Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Overview

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

This repository contains the implementation of the following paper:

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

Yuanxun Lu, Jinxiang Chai, Xun Cao (SIGGRAPH Asia 2021)

Abstract: To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three stages. The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person's speech space. In the second stage, we learn facial dynamics and motions from the projected audio features. The predicted motions include head poses and upper body motions, where the former is generated by an autoregressive probabilistic model which models the head pose distribution of the target person. Upper body motions are deduced from head poses. In the final stage, we generate conditional feature maps from previous predictions and send them with a candidate image set to an image-to-image translation network to synthesize photorealistic renderings. Our method generalizes well to wild audio and successfully synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth. Our method also allows explicit control of head poses. Extensive qualitative and quantitative evaluations, along with user studies, demonstrate the superiority of our method over state-of-the-art techniques.

[Project Page] [Paper] [Arxiv]

Teaser

Figure 1. Given an arbitrary input audio stream, our system generates personalized and photorealistic talking-head animation in real-time. Right: May and Obama are driven by the same utterance but present different speaking characteristics.

Requirements

  • This project is successfully trained and tested on Windows10 with PyTorch 1.7 (Python 3.6). Linux and lower version PyTorch should also work (not tested). We recommend creating a new environment:
conda create -n LSP python=3.6
conda activate LSP
  • Clone the repository:
git clone https://github.com/YuanxunLu/LiveSpeechPortraits.git
cd LiveSpeechPortraits
  • FFmpeg is required to combine the audio and the silent generated videos. Please check FFmpeg for installation. For Linux users, you can also:
sudo apt-get install ffmpeg
  • Install the dependences:
pip install -r requirements.txt

Demo

  • Download the pre-trained models and data from Google Drive to the data folder. Five subjects data are released (May, Obama1, Obama2, Nadella and McStay).

  • Run the demo:

    python demo.py --id May --driving_audio ./data/input/00083.wav
    

    Results can be found under the results folder.

Citation

If you find this project useful for your research, please consider citing:

@inproceedings{LiveSpeechPortraits_SIGGRAPH_ASIA_2021,
 author = {Lu, Yuanxun and Chai, Jinxiang and Cao, Xun},
 title = {{Live Speech Portraits}: Real-Time Photorealistic Talking-Head Animation},
 journal = {ACM Transactions on Graphics},
 numpages = {17},
 volume={40},
 number={6},
 month = December,
 year = {2021},
 doi={10.1145/3478513.3480484}
} 

Acknowledgment

Comments
  • Questions about training audio2feature model

    Questions about training audio2feature model

    Hello, I am trying to reconstruct the training code and there are several questions I have:

    1. From what I saw from audio2feature_model.py, in the forward module, the size for self.audio_feats is [b, 1, nfeats, nwins] while in audio2feature.py, the dimension for audio_features is [b, T, ndim]. From my understanding(correct me if I was wrong), for batch_size=32, T=240*2, ndim=512(the APC feature dimension), the input batch for Audio2Feature model should be [32, 480, 512] (480 because mel_frame is n_frames * 2) and output size is [32, 240, 75]. Is that right?

    2. Furthermore, from your paper in section 3.2,a delay d=18 is added during training but not reflected in the code. How that works in training? For example, m0 is inferred by h0, h1,....h18?

    3. In audiovisual_dataset.py, you seemed like clipping the audio into many pieces and extract APC feature for each audio. What is the number of clips for a certain dataset, eg. 4 mins 60fps video?

    There might be some stupid questions as I am not very familiar with audio processing field, just correct me if I made mistakes, thanks!

    opened by TimmmYang 13
  • Audio to Mouth-related Motion

    Audio to Mouth-related Motion

    大佬您好, 看到论文中提到的“ Audio to Mouth-related Motion” ,这里的音频特征是使用基于中文数据集训练的(Mandarin Chinese part of the Common Voice dataset)APC模型,然后 audio2feature部分是要针对每个目标人物重新训练一遍是吗? 但是这里每个目标人物的数据可能只有几分钟(3-5分钟),这样子会不会对泛化效果影响很大,例如训练的时候使用的是一个女性角色,测试的时候使用男性去测试? 因为我目前尝试的一个方案,是基于ATnet在大量数据集上训练从音频到人脸关键点的映射,然后利用ATnet提取的音频特征,后面的操作与大佬论文中相似,也是用3-5分钟的视频微调 音频特征到人脸表情相关参数的映射,但是目前测试的效果,感觉泛化能力并不理想。因此想请教大佬的看法

    opened by DWCTOD 9
  • Input Discriminator channels

    Input Discriminator channels

    Hi I noticed that for the discriminator inputs you have 23 channels.

    Why is this? I would have thought the input would have 3 channels for each of the RGB channels in an image

    Thanks

    opened by torphix 5
  • I have 1 question about mean_pts3d.npy and tracked3D_normalized_pts_fix_contour.npy, can you help me?

    I have 1 question about mean_pts3d.npy and tracked3D_normalized_pts_fix_contour.npy, can you help me?

    1. Can you explain the difference between a and b? I understand that a is the average of the 3d landmark, and b is the 3d landmark。Please understand I'm so rubbish。THANKS!
    opened by 1615070057 5
  • about Data preprocess

    about Data preprocess

    Impressive job!

    I wonder how to preprocess the images. Specifically, could you please share the scripts on choosing the four candidate images from the sequences and how to draw the shoulder edges since the landmark detectors I have found are all face landmark detectors.

    Thanks !

    opened by X-niper 5
  • i want to try to train a model of personal speaker

    i want to try to train a model of personal speaker

    Hi , i want to try to train a model of personal speaker. Can I train only the Audio2Feature model and Audio2Headpose model based on your APC model weights?(In other words, is the APC model generalizable?) Can you give me some advice, thanks a lot

    opened by VERMANs 4
  • Target Speech Representation Database

    Target Speech Representation Database

    Hi Thank you for amazing lib and open source code,

    Helping me learn a lot. One question I had was with regards to the target speech representation database. Is it simply the embedding of several speech from target speaker and then the inputted speech is essentially mapped to the closest point within those embeddings?

    Eg: Extract embedding from 50 obama utterances -> input arbitrary speech sample -> map embedding of arbitrary speech sample to the closest obama representation

    Thank you

    opened by torphix 3
  • How to generate APC_feature_base.npy of each person?

    How to generate APC_feature_base.npy of each person?

    dear fellow I found manifold projection use APC_feature_base.npy each person, but not clear how to generate this file. Is that use target person voice to train audio2feature_model?

    opened by vicdxxx 3
  • Error when run the demo

    Error when run the demo

    cv2.error: OpenCV(4.5.4) :-1: error: (-5:Bad argument) in function 'line'

    Overload resolution failed:

    • Can't parse 'pt1'. Sequence item with index 0 has a wrong type
    • Can't parse 'pt1'. Sequence item with index 0 has a wrong type I got this error when run the demo. Any solutions, please?
    opened by muxiddin19 3
  • about Image2Image translation inference issue

    about Image2Image translation inference issue

    1. Image2Image translation & Saving results... Image2Image translation inference: 0%| | 0/672 [00:00<?, ?it/s] Traceback (most recent call last): File "demo.py", line 264, in facedataset.dataset.image_pad) File "C:\Users\23046\LiveSpeechPortraits\datasets\face_dataset.py", line 280, in get_data_test_mode feature_map = torch.from_numpy(self.get_feature_image(landmarks, (self.opt.loadSize, self.opt.loadSize), shoulder, pad)[np.newaxis, :].astype(np.float32)/255.) File "C:\Users\23046\LiveSpeechPortraits\datasets\face_dataset.py", line 287, in get_feature_image im_edges = self.draw_face_feature_maps(landmarks, size) File "C:\Users\23046\LiveSpeechPortraits\datasets\face_dataset.py", line 317, in draw_face_feature_maps im_edges = cv2.line(im_edges, tuple(keypoints[edge[i]]), tuple(keypoints[edge[i+1]]), 255, 2) cv2.error: OpenCV(4.5.4) :-1: error: (-5:Bad argument) in function 'line'

    Overload resolution failed:

    • Can't parse 'pt1'. Sequence item with index 0 has a wrong type
    • Can't parse 'pt1'. Sequence item with index 0 has a wrong type
    opened by huaishuiweizhu 3
  • Generate live speech for person arbitrary!!

    Generate live speech for person arbitrary!!

    Thanks for sharing such nice work Can I run this code for person arbitrary? How the information in the folder is obtained for each person?

    Thank you for your help!

    opened by tylersky1993 3
  • Personalized data generation

    Personalized data generation

    hey, I cloned the repo. Its working fine with the pretrained model and data generated. But, I wish to train model on new face, and how to generate data for the new face. Is there any source code available?? if yes, whether it is included in this repo? If not can you please provide the same.

    If anyone has worked on personalized datasets. please do comment. Your help is appreciated. Thanks

    opened by arvind-kr7 0
  • ModuleNotFoundError: No module named 'numba.decorators'

    ModuleNotFoundError: No module named 'numba.decorators'

    I get the following error trying to execute the demo. Searching on the error I find suggestions for changing the numba/librosa versions but would appreciate guidance for this project.

    (LSP) C:\Users\Kris\source\repos\LiveSpeechPortraits>python demo.py --id May --driving_audio .\data\Input\00083.wav --device cuda Traceback (most recent call last): File "demo.py", line 8, in import librosa File "C:\Users\Kris\anaconda3\envs\LSP\lib\site-packages\librosa_init_.py", line 13, in from . import core File "C:\Users\Kris\anaconda3\envs\LSP\lib\site-packages\librosa\core_init_.py", line 114, in from .time_frequency import * # pylint: disable=wildcard-import File "C:\Users\Kris\anaconda3\envs\LSP\lib\site-packages\librosa\core\time_frequency.py", line 10, in from ..util.exceptions import ParameterError File "C:\Users\Kris\anaconda3\envs\LSP\lib\site-packages\librosa\util_init_.py", line 73, in from . import decorators File "C:\Users\Kris\anaconda3\envs\LSP\lib\site-packages\librosa\util\decorators.py", line 9, in from numba.decorators import jit as optional_jit ModuleNotFoundError: No module named 'numba.decorators'

    We followed the README instructions for the LSP environment, instaled PyTorch 1.7 and the requirements like so:

    conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=9.2 -c pytorch pip install -r requirements.txt

    Here are the versions of the packages:

    Requirement already satisfied: tqdm in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from -r requirements.txt (line 1)) (4.64.0) Requirement already satisfied: librosa==0.7.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from -r requirements.txt (line 2)) (0.7.0) Requirement already satisfied: scikit_image in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from -r requirements.txt (line 3)) (0.17.2) Requirement already satisfied: opencv_python==4.4.0.40 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from -r requirements.txt (line 4)) (4.4.0.40) Requirement already satisfied: scipy in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from -r requirements.txt (line 5)) (1.5.4) Requirement already satisfied: dominate in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from -r requirements.txt (line 6)) (2.7.0) Requirement already satisfied: albumentations==0.5.2 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from -r requirements.txt (line 7)) (0.5.2) Requirement already satisfied: numpy in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from -r requirements.txt (line 8)) (1.19.2) Requirement already satisfied: beautifulsoup4 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from -r requirements.txt (line 9)) (4.11.1) Requirement already satisfied: decorator>=3.0.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from librosa==0.7.0->-r requirements.txt (line 2)) (4.4.2) Requirement already satisfied: soundfile>=0.9.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from librosa==0.7.0->-r requirements.txt (line 2)) (0.10.3.post1) Requirement already satisfied: six>=1.3 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from librosa==0.7.0->-r requirements.txt (line 2)) (1.16.0) Requirement already satisfied: numba>=0.38.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from librosa==0.7.0->-r requirements.txt (line 2)) (0.53.1) Requirement already satisfied: scikit-learn!=0.19.0,>=0.14.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from librosa==0.7.0->-r requirements.txt (line 2)) (0.24.2) Requirement already satisfied: resampy>=0.2.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from librosa==0.7.0->-r requirements.txt (line 2)) (0.4.0) Requirement already satisfied: joblib>=0.12 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from librosa==0.7.0->-r requirements.txt (line 2)) (1.1.0) Requirement already satisfied: audioread>=2.0.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from librosa==0.7.0->-r requirements.txt (line 2)) (3.0.0) Requirement already satisfied: opencv-python-headless>=4.1.1 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from albumentations==0.5.2->-r requirements.txt (line 7)) (4.6.0.66) Requirement already satisfied: PyYAML in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from albumentations==0.5.2->-r requirements.txt (line 7)) (6.0) Requirement already satisfied: imgaug>=0.4.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from albumentations==0.5.2->-r requirements.txt (line 7)) (0.4.0) Requirement already satisfied: importlib-resources in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from tqdm->-r requirements.txt (line 1)) (5.4.0) Requirement already satisfied: colorama in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from tqdm->-r requirements.txt (line 1)) (0.4.5) Requirement already satisfied: imageio>=2.3.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from scikit_image->-r requirements.txt (line 3)) (2.15.0) Requirement already satisfied: pillow!=7.1.0,!=7.1.1,>=4.3.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from scikit_image->-r requirements.txt (line 3)) (8.4.0) Requirement already satisfied: tifffile>=2019.7.26 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from scikit_image->-r requirements.txt (line 3)) (2020.9.3) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from scikit_image->-r requirements.txt (line 3)) (3.3.4) Requirement already satisfied: PyWavelets>=1.1.1 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from scikit_image->-r requirements.txt (line 3)) (1.1.1) Requirement already satisfied: networkx>=2.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from scikit_image->-r requirements.txt (line 3)) (2.5.1) Requirement already satisfied: soupsieve>1.2 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from beautifulsoup4->-r requirements.txt (line 9)) (2.3.2.post1) Requirement already satisfied: Shapely in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from imgaug>=0.4.0->albumentations==0.5.2->-r requirements.txt (line 7)) (1.8.2) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from matplotlib!=3.0.0,>=2.0.0->scikit_image->-r requirements.txt (line 3)) (3.0.9) Requirement already satisfied: python-dateutil>=2.1 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from matplotlib!=3.0.0,>=2.0.0->scikit_image->-r requirements.txt (line 3)) (2.8.2) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from matplotlib!=3.0.0,>=2.0.0->scikit_image->-r requirements.txt (line 3)) (1.3.1) Requirement already satisfied: cycler>=0.10 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from matplotlib!=3.0.0,>=2.0.0->scikit_image->-r requirements.txt (line 3)) (0.11.0) Requirement already satisfied: setuptools in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from numba>=0.38.0->librosa==0.7.0->-r requirements.txt (line 2)) (58.0.4) Requirement already satisfied: llvmlite<0.37,>=0.36.0rc1 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from numba>=0.38.0->librosa==0.7.0->-r requirements.txt (line 2)) (0.36.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from scikit-learn!=0.19.0,>=0.14.0->librosa==0.7.0->-r requirements.txt (line 2)) (3.1.0) Requirement already satisfied: cffi>=1.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from soundfile>=0.9.0->librosa==0.7.0->-r requirements.txt (line 2)) (1.15.1) Requirement already satisfied: pycparser in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from cffi>=1.0->soundfile>=0.9.0->librosa==0.7.0->-r requirements.txt (line 2)) (2.21) Requirement already satisfied: zipp>=3.1.0 in c:\users\kris\anaconda3\envs\lsp\lib\site-packages (from importlib-resources->tqdm->-r requirements.txt (line 1)) (3.6.0)

    opened by Robertsmania 1
  • how to alinge the 3dlandmark tracked from the input video, to get the fixed 3dlandmark which is  fixing by the contour or center point ?

    how to alinge the 3dlandmark tracked from the input video, to get the fixed 3dlandmark which is fixing by the contour or center point ?

    is there some resource? or some fake code? https://www.mathworks.com/help/vision/ref/estimategeometrictransform3d.html https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.transform.Rotation.align_vectors.html
    http://learning.aols.org/aols/3D_Affine_Coordinate_Transformations.pdf https://en.wikipedia.org/wiki/Kabsch_algorithm

    opened by foocker 0
  • one new project of lsp for complete training and deploy can look at here?

    one new project of lsp for complete training and deploy can look at here?

    consider the security/licence of original project,i recreat the alg base on deca and LiveSpeechPotraits and the other block is comming. there are still some problem, welcome discuss and pr. here: lsp

    opened by foocker 0
  • about camera model

    about camera model

    in inference, your camer model is fixed, which means camera_intrinsic, sacle etc is constant. but when i use deca, every image has its own camera paremeters, so in traing data, camer parameter are different for ervey frame. how to fix this problem?

    opened by foocker 0
Owner
OldSix
Ph.D candidate in CITE Lab, Nanjing University
OldSix
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 8, 2022
Clone a voice in 5 seconds to generate arbitrary speech in real-time

This repository is forked from Real-Time-Voice-Cloning which only support English. English | 中文 Features ?? Chinese supported mandarin and tested with

Weijia Chen 25.6k Jan 6, 2023
🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

English | 中文 Features ?? Chinese supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc. ?

Vega 25.6k Dec 31, 2022
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

null 2 Jun 19, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 9, 2023
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
Leon is an open-source personal assistant who can live on your server.

Leon Your open-source personal assistant. Website :: Documentation :: Roadmap :: Contributing :: Story ?? Introduction Leon is an open-source personal

Leon AI 11.7k Dec 30, 2022
A number of methods in order to perform Natural Language Processing on live data derived from Twitter

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

null 1 Nov 24, 2021
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also very accurate having competitive accuracy with state-of-the-art open-domain QA models

Jinhyuk Lee 543 Jan 8, 2023
Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

null 537 Jan 5, 2023
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022