Code for Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

Overview

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

We propose Disentangled Audio-Visual System (DAVS) to address arbitrary-subject talking face generation in this work, which aims to synthesize a sequence of face images that correspond to given speech semantics, conditioning on either an unconstrained speech audio or video.

[Project] [Paper] [Demo]

Recommondation of our CVPR21 repo

This repo is barely maintaining since the version of this code is out of date. If you are interested in the topic of Talking Face Generation, feel free to try the CODE of our CVPR2021 PAPER!

Requirements

Generating test results

Create the default folder "checkpoints" and put the checkpoint in it or get the CHECKPOINT_PATH
  • Samples for testing can be found in this folder named 0572_0019_0003. This is a pre-processed sample from the Voxceleb Dataset.

  • Run the testing script to generate videos from video:

python test_all.py  --test_root ./0572_0019_0003/video --test_type video --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH
  • Run the testing script to generate videos from audio:
python test_all.py  --test_root ./0572_0019_0003/audio --test_type audio --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH

Sample Results

  • Talking Effect on Human Characters

  • Talking Effect on Non-human Characters (Trained on Human Faces Only)

Create more samples

  • The face detection tool used in the demo videos can be found at RSA. It will return a Matfile with 5 key point locations in a row for each image. Other face alignment methods are also appliable such as dlib. The key points for face alignement we used are the two for the center of the eyes and the average point of the corners of the mouth. With each image's PATH and the face POINTS, you can find our way of face alignment at preprocess/face_align.py.

  • Our preprocessing of the audio files is the same and borrowed from the matlab code of SyncNet. Then we save the mfcc features into bin files.

Preparing Training Data

  • We used the LRW dataset for training.
  • The directories are arranged like this:
data
├── train, val, test
|	├── 0, 1, 2 ... 499 (one folder for each class)
|	│   ├── 0, 1, 2 ... #videos per class
|	│   │   ├── align_face256
|	│   │   |   ├── 0, 1, ... 28.jpg
|	│   |   ├── mfcc20
|	│   │   |   ├── 2, 3 ... 26.bin

where each video is extracted to frames and aligned using our protocol, and each audio is processed and saved using Matlab.

Training

python train.py
  • This is still a beta version of the training code which only disentangles wid information from pid space. Running the train.py only might not be able to fully reproduce the paper. However, it can be served as a reference for how we implement the whole training process.
  • During our own implementation, the classification part (without generation and disentanglement) is pretrained first. The pretraining training code is temporarily not provided.

Postprocessing Details (Optional)

  • The directly generated results may suffer from a "zoom-in-and-out" condition which we assume is caused by our alignment of the training set. We solve the unstable problem using Subspace Video Stabilization in the demos.

License and Citation

The use of this software is RESTRICTED to non-commercial research and educational purposes.

@inproceedings{zhou2019talking,
  title     = {Talking Face Generation by Adversarially Disentangled Audio-Visual Representation},
  author    = {Zhou, Hang and Liu, Yu and Liu, Ziwei and Luo, Ping and Wang, Xiaogang},
  booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
  year      = {2019},
}

Acknowledgement

The structure of this codebase is borrowed from pix2pix.

Comments
  • Pre-Processing Data

    Pre-Processing Data

    Hey @Hangz-nju-cuhk @liuziwei7 @liuyuisanai! I am trying to understand and reproduce the results of this repository end-to-end, So I could create a docker file and contribute. I have already read the paper thoroughly and analyzing the code now. But I am having a problem with pre-processing the data. Could you please guide how could I do this? Step by step process. Looking forward to you. Thank you.

    opened by MrAsimZahid 5
  • Chinese characters are spoken faster than English words, will this model work on Chinese?

    Chinese characters are spoken faster than English words, will this model work on Chinese?

    I want to build a dataset of Chinese characters to train this model. I applied speech recognition on some Chinese news videos (by CCTV). The recognition part was fine, but I found that Chinese characters are too short in terms of pronounce time because each of them has only one syllable. The average number of video frames it takes to show the lip movement of a single Chinese character is only 5 (fps=25), and It can be even as low as 2 frames. This is much less than the required 29 frames. Obviously, interpolation won't work well in this case. So I would like to know if you guys have considered Chinese? Will this model work? Is there any workaround?

    opened by zwfcrazy 4
  • can I add new images into the demo_images folder for testing

    can I add new images into the demo_images folder for testing

    Hi,Huang zhou,I just add some new images into the demo_images for test,and find that the result image fake frames 's variation is not like the four demo images , does this repo code support any other else image test? or Should I do some preprocssing work on my own images ?

    opened by jianglingling007 4
  • issues about train and some train error

    issues about train and some train error

    你好,我是中国科学技术大学的一位本科生,最近我和一些小伙伴想学习一些talking face generation 方面的工作,我们选择复现您的代码,预处理方面还比较顺利地按照您的思想完成了,但是train方面的代码错误很多,我们最终在改动多处地方后使得train代码可以正常运行了。忽然发现您在10月5日提交了一个修改,而且这个修改和我们当初的修改相同,难道您最近也在修改train方面的代码吗?请问您是否有新版本的代码?我们很想和您交流一下 如果您愿意,这是我的邮箱[email protected] 如果您愿意,我们将不胜感激

    opened by chenshi3 2
  • Strange filterbank parameter value

    Strange filterbank parameter value

    https://github.com/Hangz-nju-cuhk/Talking-Face-Generation-DAVS/blob/c0233ace95be15fb1665dfcd056d82117822a797/preprocess/savemfcc.m#L7

    In the Readme it is suggested that you use a similar audio pre-processing as Zimmerman et al. However, they use 40 filterbank channels across their code (e.g. in the yousaidthat repository https://github.com/joonson/yousaidthat/blob/98b51812894497cb6c2b65a7ae147067609fc6ca/run_demo.m#L22) I was wondering if there was a reason for choosing 13, or if it had just been mixed up with the number of cepstral coefficients.

    Thanks,

    opened by roodrallec 2
  • Undefined function or variable 'vec2frames'.

    Undefined function or variable 'vec2frames'.

    Hello, I use the savemfcc.m to generate *.bin file but when I execute the code, an error occurred.

    savemfcc('~/talkingface//20180619_1_M.wav','~/talkingface/tlkface/wav')
    Undefined function or variable 'vec2frames'.

    Error in mfcc (line 151) frames = vec2frames( speech, Nw, Ns, 'cols', window, false );

    Error in runmfcc (line 5) [ CC, FBE, frames ] = mfcc( speech, opt.fs, opt.Tw, opt.Ts, opt.alpha, hamming, opt.R, opt.M, N, opt.L );

    Error in savemfcc (line 17) [ MFCCs, ~, ~ ] = runmfcc( Speech, opt );

    Could you please tell me where to find 'vec2frames'?

    opened by zzzzhuque 2
  • What's the meaning of the parameter --test_audio_video_length?

    What's the meaning of the parameter --test_audio_video_length?

    In the test command: python test_all.py --test_root ./0572_0019_0003/video --test_type video --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH What's the meaning of the parameter --test_audio_video_length?

    opened by ZhengMengbin 1
  • Questions about pretraining process and small errors in train.py

    Questions about pretraining process and small errors in train.py

    Hi, firstly I want to thank you for sharing such a great project. However, I noticed that you wrote 'The pretraining training code is temporarily not provided.' in README.md, so I was wondering if my understanding is right about the classification part. Here is my own assumption:

    1. Use the subset of the MS-Celeb-1M dataset to train the ID_encoder part.
    2. Use the optimize_parameters_no_generation() function in Gen_final_v1.py and LRW dataset to train the lip_feature_encoder, mfcc_encoder and model_fusion part. Moreover, when I read and try to train the model using train.py, I find some small errors. For example , opt.isTrain and opt.eval_freq are not defined in Options.py and pair in lip_reading_loader() should be (2,25), since there are only 24 files in /mfcc20. So I want to know if you will update the project later which will be of a great help to me.
    opened by jixinya 1
  • Can I use some other audio for testing except for the example 0572_0019_0003.wav?

    Can I use some other audio for testing except for the example 0572_0019_0003.wav?

    I have just try some other audios for test(I use the matlab taking the *.wav file into mfcc bin files),find that most of the wav files did not make the modle generate the images, only a few can. Can this mode support this test?

    opened by jianglingling007 1
  • Is audio video offset considered in LRW?

    Is audio video offset considered in LRW?

    @Hangz-nju-cuhk LRW is used to train the model according to your paper, but there are audio video offset in LRW videos. And [11] used SyncNet when pre-processing dataset to correct the offset. Did you consider this problem when preparing dataset? Thank you!

    [11] Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. You said that? arXiv preprint arXiv:1705.02966, 2017.

    opened by leeeeeeo 1
  • Generation from audio

    Generation from audio

    Hi~ Thanks for your code, but I encountered some problems when I run the testing script to generate videos from audio.

    1. In Test_load_audio.py,there seems config has no require_audio attribute, because when I run python test_all.py --test_root './0572_0019_0003/audio' --test_type 'audio' --test_audio_video_length 99 --test_resume_path, I got the error AttributeError: 'Namespace' object has no attribute 'require_audio
    2. When will you release the complete code?
    opened by 7aughing 1
  • question about computing contrastive loss

    question about computing contrastive loss

    Hi, Why you use Variable(self.lip_embedding_norm.data) instead of direct use self.lip_embedding_norm here? And why denominator is 2*batch_size instead of batch_size^2? image image

    opened by Pixie412 0
  • pretrain checkpoing tarball looks like not one tar file

    pretrain checkpoing tarball looks like not one tar file

    I downloaded pretrain checkpoint from the link: https://drive.google.com/file/d/1UW22xm4r9AewNoySyPd2fyUab0nqymBR/view. But when I tried to extract it, it report the following error: tar: This does not look like a tar archive tar: Skipping to next header tar: Exiting with failure status due to previous errors

    Would you please check the tarball?

    opened by la0216 0
  • 请教一下关于mouth.txt数据的处理过程

    请教一下关于mouth.txt数据的处理过程

    大家好,想跟大家请教一下关于mouth.txt 的数据处理过程,mouth.txt 的数据分别表示什么,多少维度的,我没有下载作者使用的数据集,我想使用新闻主持人的数据,但是我不知道如果生成mouth.txt 文件。如果有人处理处mouth.txt的数据,希望能告诉我mouth.txt的数据是多少维度的 分别表示什么意思,谢谢@liuziwei7 @liuyuisanai @Hangz-nju-cuhk

    opened by love112358 0
  • How to turn the output result of test_all into a video (image + audio) form

    How to turn the output result of test_all into a video (image + audio) form

    Thank you for your questions and solutions for your great help. Now, I have successfully got test_all (video and audio) output result, the question now is how to combine the output result with the original audio to form a complete video, and to ensure that the mouth shape and audio in the video are aligned @Hangz-nju-cuhk

    opened by love112358 0
  • face_align

    face_align

    hi, thank you for your sharing. It's profitable for me, in face_align.py, I wound not understand how this points is generated points = [[70, 112], [110, 112], [90, 150]] I am looking forward to your reply, thank you.

    opened by HITxyer 1
Owner
Hang_Zhou
Ph.D. @ MMLab-CUHK
Hang_Zhou
DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control

DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control One version of our system is implemented using the

null 260 Nov 28, 2022
Official code for CVPR2022 paper: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

?? Depth-Aware Generative Adversarial Network for Talking Head Video Generation (CVPR 2022) ?? If DaGAN is helpful in your photos/projects, please hel

Fa-Ting Hong 503 Jan 4, 2023
This repository contains a PyTorch implementation of "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis".

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis | Project Page | Paper | PyTorch implementation for the paper "AD-NeRF: Audio

null 551 Dec 29, 2022
FACIAL: Synthesizing Dynamic Talking Face With Implicit Attribute Learning. ICCV, 2021.

FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning PyTorch implementation for the paper: FACIAL: Synthesizing Dynamic Talking

null 226 Jan 8, 2023
Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Instance-Aware Latent-Space Search This is a PyTorch implementation of the following paper: Disentangled Face Attribute Editing via Instance-Aware Lat

null 67 Dec 21, 2022
Disentangled Lifespan Face Synthesis

Disentangled Lifespan Face Synthesis Project Page | Paper Demo on Colab Preparation Please follow this github to prepare the environments and dataset.

何森 50 Sep 20, 2022
A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation (ICCV 2021)

A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation (ICCV 2021) This repository contains the official implemen

null 81 Dec 14, 2022
[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Disentangled Representation Learning for Text-Video Retrieval This is a PyTorch implementation of the paper Disentangled Representation Learning for T

Qiang Wang 49 Dec 18, 2022
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation This is a demo implementation of BYOL for Audio (BYOL-A), a self-sup

NTT Communication Science Laboratories 160 Jan 4, 2023
StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation

StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation Demo video: CVPR 2021 Oral: Single Channel Manipulation: Localized or attribu

Zongze Wu 267 Dec 30, 2022
Implementation of StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation in PyTorch

StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation Implementation of StyleSpace Analysis: Disentangled Controls for StyleGAN Ima

Xuanchi Ren 86 Dec 7, 2022
Code for the paper: Learning Adversarially Robust Representations via Worst-Case Mutual Information Maximization (https://arxiv.org/abs/2002.11798)

Representation Robustness Evaluations Our implementation is based on code from MadryLab's robustness package and Devon Hjelm's Deep InfoMax. For all t

Sicheng 19 Dec 7, 2022
Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun

ARAE Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun https://arxiv.org/abs/1706.04223 Disc

Junbo (Jake) Zhao 399 Jan 2, 2023
Official Code for AdvRush: Searching for Adversarially Robust Neural Architectures (ICCV '21)

AdvRush Official Code for AdvRush: Searching for Adversarially Robust Neural Architectures (ICCV '21) Environmental Set-up Python == 3.6.12, PyTorch =

null 11 Dec 10, 2022
DVG-Face: Dual Variational Generation for Heterogeneous Face Recognition, TPAMI 2021

DVG-Face: Dual Variational Generation for HFR This repo is a PyTorch implementation of DVG-Face: Dual Variational Generation for Heterogeneous Face Re

null 52 Dec 30, 2022
A large-scale face dataset for face parsing, recognition, generation and editing.

CelebAMask-HQ [Paper] [Demo] CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA da

switchnorm 1.7k Dec 26, 2022
Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation (CVPR 2021)

Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation Input Image Initial CAM Successive Maps with adversar

Jungbeom Lee 110 Dec 7, 2022
[NeurIPS2021] Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks

Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks Code for NeurIPS 2021 Paper "Exploring Architectural Ingredients of A

Hanxun Huang 26 Dec 1, 2022
Official implementation of "Can You Spot the Chameleon? Adversarially Camouflaging Images from Co-Salient Object Detection" in CVPR 2022.

Jadena Official implementation of "Can You Spot the Chameleon? Adversarially Camouflaging Images from Co-Salient Object Detection" in CVPR 2022. arXiv

Qing Guo 13 Nov 29, 2022