Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Overview

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu.

Project | Paper | Demo

We propose Pose-Controllable Audio-Visual System (PC-AVS), which achieves free pose control when driving arbitrary talking faces with audios. Instead of learning pose motions from audios, we leverage another pose source video to compensate only for head motions. The key is to devise an implicit low-dimension pose code that is free of mouth shape or identity information. In this way, audio-visual representations are modularized into spaces of three key factors: speech content, head pose, and identity information.

Requirements

  • Python 3.6 and Pytorch 1.3.0 are used. Basic requirements are listed in the 'requirements.txt'.
pip install -r requirements.txt

Quick Start: Generate Demo Results

  • Download the pre-trained checkpoints.

  • Create the default folder ./checkpoints and unzip the demo.zip at ./checkpoints/demo. There should be a 5 pths in it.

  • Unzip all *.zip files within the misc folder.

  • Run the demo scripts:

bash experiments/demo_vox.sh
  • The --gen_video argument is by default on, ffmpeg >= 4.2.0 is required to use this flag in linux systems. All frames along with an avconcat.mp4 video file will be saved in the ./id_517600055_pose_517600078_audio_681600002/results folder in the following form:

From left to right are the reference input, the generated results, the pose source video and the synced original video with the driving audio.

Prepare Testing Meta Data

  • Automatic VoxCeleb2 Data Formulation

The inference code experiments/demo.sh refers to ./misc/demo.csv for testing data paths. In linux systems, any applicable csv file can be created automatically by running:

python scripts/prepare_testing_files.py

Then modify the meta_path_vox in experiments/demo_vox.sh to './misc/demo2.csv' and run

bash experiments/demo_vox.sh

An additional result should be seen saved.

  • Metadata Details

Detailedly, in scripts/prepare_testing_files.py there are certain flags which enjoy great flexibility when formulating the metadata:

  1. --src_pose_path denotes the driving pose source path. It can be an mp4 file or a folder containing frames in the form of %06d.jpg starting from 0.

  2. --src_audio_path denotes the audio source's path. It can be an mp3 audio file or an mp4 video file. If a video is given, the frames will be automatically saved in ./misc/Mouth_Source/video_name, and disables the --src_mouth_frame_path flag.

  3. --src_mouth_frame_path. When --src_audio_path is not a video path, this flags could provide the folder containing the video frames synced with the source audio.

  4. --src_input_path is the path to the input reference image. When the path is a video file, we will convert it to frames.

  5. --csv_path the path to the to-be-saved metadata.

You can manually modify the metadata csv file or add lines to it according to the rules defined in the scripts/prepare_testing_files.py file or the dataloader data/voxtest_dataset.py.

We provide a number of demo choices in the misc folder, including several ones used in our video. Feel free to rearrange them even across folders. And you are welcome to record audio files by yourself.

  • Self-Prepared Data Processing

Our model handles only VoxCeleb2-like cropped data, thus pre-processing is needed for self-prepared data.

  • Coming soon

Train Your Own Model

  • Coming soon

License and Citation

The usage of this software is under CC-BY-4.0.

@InProceedings{zhou2021pose,
author = {Zhou, Hang and Sun, Yasheng and Wu, Wayne and Loy, Chen Change and Wang, Xiaogang and Liu, Ziwei},
title = {Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

Acknowledgement

Comments
  • Talking Face with just audio input

    Talking Face with just audio input

    Hi, thank you for your amazing work.

    I am just wondering if it's possible to render without a mouth frame and just based on the audio?(similar to what Wav2Lip does)

    If so, can you tell me how to do it? Because I've been trying to figure it out if it's possible and keep on running into Exception: None Image error if I put the paths for mouth frames to None and amount of frames to 0 in demo.csv

    opened by bycloudai 7
  • About constrast learning

    About constrast learning

    Thanks for sharing your work! Where is the sricpt for constrast learning for image feature and audio feature. I found a class in "models\networks\loss.py": SoftmaxContrastiveLoss, is this the realization of contrast learning?

    opened by NNNNAI 5
  • Question on pose space training

    Question on pose space training

    Thanks so much for the great work and codes. When I read the paper and codes, I get confused about the pose space learning part.

    As the training strategy said in the paper, firstly pre-train the identity encoder and speech content space and then loaded to the overall framework to train the generator and pose space learning, I could understand the training procedure, however, for learning the pose space I am confused if you use the loss (compute_diff_loss)

    https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L401

    when training the whole generator. If so, it is consistent with the codes to compute the l1 loss on pose differences and pose_feature_audio.

    Last but not least, congratulations on the research progress, I think it is a great breakthrough to disentangle the sync information and head pose in the feature representation. Looking forward to your reply!

    opened by eddiecong 4
  • Training code will be released?

    Training code will be released?

    Hi,

    Congratulations on this great work and thanks for releasing the codes! The results are mighty impressive! Any plans on releasing the training code too?

    opened by Rudrabha 3
  • How to calculate the file

    How to calculate the file "demo.csv" parameters (160, 363)?

    Hi, thanks for your greate work. what about the parameters in file "demo.csv" (160, 363)

    misc/Input/517600055 1 misc/Pose_Source/517600078 160 misc/Audio_Source/681600002.mp3 misc/Mouth_Source/681600002 363 dummy

    opened by DWCTOD 3
  • Questions about the demo video in project page

    Questions about the demo video in project page

    In the demo video on the project page, there are generated faces of Obama and Biden,of time stamp 2:26. image It seems that there exists a little identity mismatch problem, which may be a potential improvement direction. Could you please tell us the source videos you use in this situation or the location in VoxCeleb2(maybe) for better reference? Thanks in advance and appreciate the great work PC-AVS!

    opened by DaddyJin 2
  • can you share the details of the augmentation generating non-id space

    can you share the details of the augmentation generating non-id space

    i notice that you mentioned the paper "Neural Head Reenactment with Latent Pose Descriptors" in an issue about non-id space. i tried the augmentation used in their code, but it seems different from yours, with fewer changes. this part seems vital considering the non-id space is vital, promising the model to distangle the id feature and pose feature. so i would like to know the augmentation details in your paper. can you share it to us?

    opened by makpia 2
  • About pretrained speech content space

    About pretrained speech content space

    Thank you for the great work. In the Equation 2, you use F_c^v and F_c^a to calculate the loss function L_c^v2a. However, in the code "av_model.py", when mode == 'sync', you use function **sync(self, augmented, spectrogram)**to train the speech content space. If i am not wrong, in the function, you use F_n of Non-Identity space and F_c^a to calculate the loss function L_c^v2a.

    Does it play the same role as equation 2? Looking for your reply and best wishes!

    opened by DaddyJin 2
  • Running code on own image?

    Running code on own image?

    Hello, I have configured the project as suggested here. I am able to run it on the demo images placed inside ./misc/input/some_id My question is how can I run the project on my own image.

    For example when I am creating a folder inside ./misc/input/ of name 123456 and then placing my own image of size 224x224 inside it with name 000000.jpg (the complete path is ./misc/input/123456/000000.jpg), and then after changing the demo.csv file when I am running the code then I am not getting desired results.

    Please help me.

    opened by mayanktiwariiiitdmj 2
  •  LRW + VoxCeleb

    LRW + VoxCeleb

    Hi Hang, did you combine the two dataset for training the lip-sync and test separately on each, or did you separate the training dataset - i wonder if voxceleb2 alone can achieve the good performance in terms of the lip-sync. Thanks.

    opened by Yingying6 1
  • Is it possible to do lip sync without audio?

    Is it possible to do lip sync without audio?

    I'm just curious whether if its possible to just have mouth_source and input to generate a video, where the audio source can be ignored (be None). Similar to how lips sync works. I tested on mine and it didn't work.

    Sorry to bother you again. I tried to figure out by looking through the codes but that didn't help.

    opened by bycloudai 1
  • ffmpeg: not found

    ffmpeg: not found

    Some trouble during “sudo apt install ffmpeg“

    sudo add-apt-repository ppa:savoury1/ffmpeg4 sudo add-apt-repository ppa:savoury1/graphics

    https://blog.csdn.net/jn10010537/article/details/124078608

    opened by yfszzx 0
  • Why start from 2?

    Why start from 2?

    self.target_frame_inds = np.arange(2, len(self.spectrogram) // self.audio.num_bins_per_frame - 2) In voxtest_dataset.py L107, target frame index starts from 2. But in the paper, it starts from 1 (it means start from 0 in python). I don't understand why it is. Thanks.

    opened by 9B8DY6 0
  • Does training procedure need any other py file or module that I have to make on my own?

    Does training procedure need any other py file or module that I have to make on my own?

    The author Hangz answered other's question about training code. He answered that losses or differenct modules for training are available on av_models.py. But is there any difference b.t.w training and inference when I have to make it on my own?

    opened by 9B8DY6 0
  • why embedding the audio features

    why embedding the audio features

    Hi, thanks for sharing this great work!

    I understand the main pipeline, i.e., encoding the speech content features, id features, and pose features respectively then feeding them to the generator for the driven results. But I am a little bit confused after reading the inference code. https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L473-L484

    As can be seen, the mel-spectrogram is encoded by the audio encoder first in Line 473 and is ready to be fused with the pose feature in Line 483. However, in the merge_mouthpose() function: https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L454-L461

    I found the audio features are further embedded, what is the intuition behind that? In my view, the netE.mouth_embed would be used to embed the mouth features for the video but NOT for the audio. If anything is wrong, please correct me. Thanks in advance.

    opened by e4s2022 0
Owner
Hang_Zhou
Ph.D. Candidate @ MMLab-CUHK
Hang_Zhou
FACIAL: Synthesizing Dynamic Talking Face With Implicit Attribute Learning. ICCV, 2021.

FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning PyTorch implementation for the paper: FACIAL: Synthesizing Dynamic Talking

null 226 Jan 8, 2023
Official pytorch code for SSC-GAN: Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation(ICCV 2021)

SSC-GAN_repo Pytorch implementation for 'Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation'.PDF SSC-GAN:Sem

tyty 4 Aug 28, 2022
Official code for CVPR2022 paper: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

?? Depth-Aware Generative Adversarial Network for Talking Head Video Generation (CVPR 2022) ?? If DaGAN is helpful in your photos/projects, please hel

Fa-Ting Hong 503 Jan 4, 2023
This repository contains a PyTorch implementation of "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis".

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis | Project Page | Paper | PyTorch implementation for the paper "AD-NeRF: Audio

null 551 Dec 29, 2022
Repository for the paper "PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation", CVPR 2021.

PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation Code repository for the paper: PoseAug: A Differentiable Pose Augme

Pyjcsx 328 Dec 17, 2022
This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

JigsawClustering Jigsaw Clustering for Unsupervised Visual Representation Learning Pengguang Chen, Shu Liu, Jiaya Jia Introduction This project provid

DV Lab 73 Sep 18, 2022
SE3 Pose Interp - Interpolate camera pose or trajectory in SE3, pose interpolation, trajectory interpolation

SE3 Pose Interpolation Pose estimated from SLAM system are always discrete, and

Ran Cheng 4 Dec 15, 2022
git git《Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking》(CVPR 2021) GitHub:git2] 《Masksembles for Uncertainty Estimation》(CVPR 2021) GitHub:git3]

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li Accepted by CVPR

NingWang 236 Dec 22, 2022
Official project website for the CVPR 2021 paper "Exploring intermediate representation for monocular vehicle pose estimation"

EgoNet Official project website for the CVPR 2021 paper "Exploring intermediate representation for monocular vehicle pose estimation". This repo inclu

Shichao Li 138 Dec 9, 2022
《Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement》(ECCV 2020) GitHub: [fig9]

Unsupervised 3D Human Pose Representation [Paper] The implementation of our paper Unsupervised 3D Human Pose Representation with Viewpoint and Pose Di

null 42 Nov 24, 2022
DVG-Face: Dual Variational Generation for Heterogeneous Face Recognition, TPAMI 2021

DVG-Face: Dual Variational Generation for HFR This repo is a PyTorch implementation of DVG-Face: Dual Variational Generation for Heterogeneous Face Re

null 52 Dec 30, 2022
Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

InversePrompting Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting Code: The code is provided in the "chinese_ip"

THUDM 101 Dec 16, 2022
The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

Ren Yurui 261 Jan 9, 2023
The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

Website | ArXiv | Get Start | Video PIRenderer The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic

Ren Yurui 81 Sep 25, 2021
FaceVerse: a Fine-grained and Detail-controllable 3D Face Morphable Model from a Hybrid Dataset (CVPR2022)

FaceVerse FaceVerse: a Fine-grained and Detail-controllable 3D Face Morphable Model from a Hybrid Dataset Lizhen Wang, Zhiyuan Chen, Tao Yu, Chenguang

Lizhen Wang 219 Dec 28, 2022
img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation

img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation Figure 1: We estimate the 6DoF rigid transformation of a 3D face (rendered in si

Vítor Albiero 519 Dec 29, 2022
Face and Pose detector that emits MQTT events when a face or human body is detected and not detected.

Face Detect MQTT Face or Pose detector that emits MQTT events when a face or human body is detected and not detected. I built this as an alternative t

Jacob Morris 38 Oct 21, 2022
Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning By Zhenda Xie*, Yutong Lin*, Zheng Zhang, Yue Ca

Zhenda Xie 293 Dec 20, 2022
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation This is a demo implementation of BYOL for Audio (BYOL-A), a self-sup

NTT Communication Science Laboratories 160 Jan 4, 2023