Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss (ATVGnet)

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss (ATVGnet)

By Lele Chen , Ross K Maddox, Zhiyao Duan, Chenliang Xu.

University of Rochester.

Table of Contents

  1. Introduction
  2. Citation
  3. Running
  4. Model
  5. Results
  6. Disclaimer and known issues

Introduction

This repository contains the original models (AT-net, VG-net) described in the paper Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss. The demo video is avaliable at https://youtu.be/eH7h_bDRX2Q. This code can be applied directly in LRW and GRID. The outputs from the model are visualized here: the first one is the synthesized landmark from ATnet, the rest of them are attention, motion map and final results from VGnet.

model model

Citation

If you use any codes, models or the ideas from this repo in your research, please cite:

@inproceedings{chen2019hierarchical,
  title={Hierarchical cross-modal talking face generation with dynamic pixel-wise loss},
  author={Chen, Lele and Maddox, Ross K and Duan, Zhiyao and Xu, Chenliang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={7832--7841},
  year={2019}
}

Running

  1. This code is tested under Python 2.7. The model we provided is trained on LRW. However, it works fine on GRID,VOXCELB and other datasets. You can directly compare this model on other dataset with your own model. We treat this as fair comparison.

  2. Pytorch environment:Pytorch 0.4.1. (conda install pytorch=0.4.1 torchvision cuda90 -c pytorch)

  3. Install requirements.txt (pip install -r requirement.txt)

  4. Download the pretrained ATnet and VGnet weights at google drive. Put the weights under model folder.

  5. Run the demo code: python demo.py

    • -device_ids: gpu id
    • -cuda: using cuda or not
    • -vg_model: pretrained VGnet weight
    • -at_model: pretrained ATnet weight
    • -lstm: use lstm or not
    • -p: input example image
    • -i: input audio file
    • -lstm: use lstm or not
    • -sample_dir: folder to save the outputs
    • ...
  6. Download and unzip the training data from LRW

  7. Preprocess the data (Extract landmark and crop the image by dlib).

  8. Train the ATnet model: python atnet.py

    • -device_ids: gpu id
    • -batch_size: batch size
    • -model_dir: folder to save weights
    • -lstm: use lstm or not
    • -sample_dir: folder to save visualized images during training
    • ...
  9. Test the model: python atnet_test.py

    • -device_ids: gpu id
    • -batch_size: batch size
    • -model_name: pretrained weights
    • -sample_dir: folder to save the outputs
    • -lstm: use lstm or not
    • ...
  10. Train the VGnet: python vgnet.py

    • -device_ids: gpu id
    • -batch_size: batch size
    • -model_dir: folder to save weights
    • -sample_dir: folder to save visualized images during training
    • ...
  11. Test the VGnet: python vgnet_test.py

    • -device_ids: gpu id
    • -batch_size: batch size
    • -model_name: pretrained weights
    • -sample_dir: folder to save the outputs
    • ...

Model

  1. Overall ATVGnet model

  2. Regresssion based discriminator network

    model

Results

  1. Result visualization on different datasets:

    visualization

  2. Reuslt compared with other SOTA methods:

    visualization

  3. The studies on image robustness respective with landmark accuracy:

    visualization

  4. Quantitative results:

    visualization

Disclaimer and known issues

  1. These codes are implmented in Pytorch.
  2. In this paper, we train LRW and GRID seperately.
  3. The model are sensitive to input images. Please use the correct preprocessing code.
  4. I didn't finish the data processing code yet. I will release it soon. But you can try the model and replace with your own image.
  5. If you want to train these models using this version of pytorch without modifications, please notice that:
    • You need at lest 12 GB GPU memory.
    • There might be some other untested issues.
  6. There is another intresting and useful research on audio to landmark genration. Please check it out at https://github.com/eeskimez/Talking-Face-Landmarks-from-Speech.

Todos

  • Release training data

License

MIT

Comments
  • Segmentation fault when running program

    Segmentation fault when running program

    Hi @lelechen63. When I ran the demo, I encountered Segmentation fault. After debugging, it was found that torch and Dlib should be the cause. Can you share the version of Dlib?

    opened by HallidayReadyOne 11
  • is there some processed example data for reference?

    is there some processed example data for reference?

    I am trying to create data for this project. Is there some images or landmark*.npy can be downloaded? just as one processed landmark npy and image region~ thank you

    opened by lianDaniel 10
  • Wrong images generated

    Wrong images generated

    Hi, I clone the code and run the demo.py, nothing error, but the video and the images generated are totally in wrong type like this.

    img
    motion
    attention

    By the way, I run in CPU mode. thanks in advance.

    opened by taylorlu 8
  • Using Pre-trained Model on GRID dataset

    Using Pre-trained Model on GRID dataset

    Dear Authors, Thanks for sharing the code. I just wanted to know whether the pre-trained model which you have released can be used on GRID dataset or not? I am only interested to run your demo.py. For example, I want to

    1. give a frame from GRID dataset as a target frame
    2. provide a .wav audio file from GRID

    Do you think that should work. Or do I need to take any special care for GRID dataset demo.

    opened by avisekiit 4
  • Where is the network architecture?

    Where is the network architecture?

    We are trying to deploy this project on an android application. In order to do so, we need to convert the pretrained pytorch model (atnet_lstm_18.pth and generator_23.pth) into tensorflow but it shows an error of 'state_dict'. When i load the pre trained models, it only gives the weights but not the architecture. Can you guide me where i can find the Model architecture? And how to convert it to tensorflow? WhatsApp Image 2019-10-22 at 9 28 59 PM

    opened by AtaUllahB 4
  • The use of mean_shape_norm.npy & S.npy & S_3d.npy?

    The use of mean_shape_norm.npy & S.npy & S_3d.npy?

    Hello, I learned about ATVGNet at CVPR2019 site.And this is a very interesting work! But when I read code after that,some place made me confuse. 1.What is the use of mean_shape_norm.npy & S.npy & S_3d.npy, I know it try to normazile landmark,but What is the individual function of the parts(mean_shape_norm.npy & S.npy & S_3d.npy)? Knowing this might give me a better understanding of the code.

    2.How to get mean_shape_norm.npy & S.npy & S_3d.npy?

    look forward to you reply

    opened by Songluchuan 3
  • AT-net label相关问题

    AT-net label相关问题

    作者您好,感谢您出色的工作。 我有两个疑问比较疑惑 1.留意到您在AT-net中的dataset处理中,将mfcc特征向量堆叠为16个拼接在一起。 想问一下堆叠为16的依据是什么?(我们在deepspeech中也有留意到相似的做法) "t_mfcc =mfcc[(r + ind - 3)*4: (r + ind + 4)*4, 1:]" 这一步操作已经是选取了前后3帧共计280ms的特征向量,为什么还要将之拼接16次呢? 2."landmark =lmark[r+1 : r + 17,:]" 正常来讲,我们的label不是以当前帧为中心,为什么这里选取了当前帧往后的16帧的landmark作为标签? 图片

    期待您的解答,感激不尽!

    opened by 821029883 2
  • Comparing on GRID dataset

    Comparing on GRID dataset

    Dear Authors, Thanks for the awesome release of the paper and code.

    I was trying to compare our result with yours on the GRID dataset for the LMD metric. Can you please tell me that in the paper

    1. Which subjects IDs of GRID did you use for testing.

    2. How many keypoints did you use for each subject ? I usually use a dlib detector which gives me 68 keypoints.

    3. Do you perform any normalization of the keypoints (after getting raw pixel coordinates using a dlib detector) to get rid of scale effects before calculating the difference on real and synthetic faces?

    4. Lastly, when you report the SSIM and PSNR: do you calculate those metrics on the entire frame or just cropped out face regions. I just wanted to make sure that we compare fairly with you. So, keenly looking forward to your kind reply.

    Thanks, Avisek Lahiri

    opened by avisekiit 2
  • State_dict

    State_dict

    Im trying to run this script.

    import torchvision import torch from somefile import modelarchitecture model = modelarchitecture() model.load_state_dict(torch.load(???)) model.eval()

    Please guide me which values will be set for "model" and path inside torch.load?

    opened by AtaUllahB 2
  • Question about PCA preprocessing

    Question about PCA preprocessing

    In demo.py, you multiply the example_landmark by 5 before applying PCA,

    example_landmark = example_landmark * 5.0
    example_landmark = example_landmark - mean.expand_as(example_landmark)
    example_landmark = torch.mm(example_landmark, pca)
    

    And for fake_landmarks, you multiply 2 times 1.1~1.5 before applying PCA

    fake_lmark[:, 1:6] *= 2 * torch.FloatTensor(np.array([1.1, 1.2, 1.3, 1.4, 1.5])).cuda()
    fake_lmark = torch.mm(fake_lmark, pca.t())
    fake_lmark = fake_lmark + mean.expand_as(fake_lmark)
    

    so (1) do you apply different scaling parameters for example_landmark and fake_landmarks? (2) how are those scaling parameters (5 vs 2.2~3.0) being selected?

    opened by pcgreat 2
  • Is it possible to generate the landmark from the VG output?

    Is it possible to generate the landmark from the VG output?

    Hi: I'm testing to get the landmarks from the output facial images. Since the VG output images are cropped, the landmarks not so stable from dlib. Is that possible to generate the landmarks directly from VG net? Since we already have a landmark input for the vg net. thanks

    opened by snowzhangy 2
  • error running demo.py cv2

    error running demo.py cv2

    Traceback (most recent call last): File "demo.py", line 486, in test() File "demo.py", line 465, in test fake_store = restore_image(orgImage,rect,fake_store,indx) File "demo.py", line 196, in restore_image cv2.normalize(img, img, 0, 255, cv2.NORM_MINMAX) cv2.error: OpenCV(4.5.5) :-1: error: (-5:Bad argument) in function 'normalize'

    Overload resolution failed:

    • Layout of the output array dst is incompatible with cv::Mat
    • Expected Ptrcv::UMat for argument 'dst'
    opened by Nyrize 0
  • Error when installing opencv

    Error when installing opencv

    https://github.com/lelechen63/ATVGnet/blob/2d4d1b03df1c706c6575b942fd2a1585347c4aab/requirement.txt#L9

    The above line in requirements.txt throws a pretty long error that is not worth reproducing in its entirety:

    Getting requirements to build wheel ... error
      ERROR: Command errored out with exit status 1:
       command: /home/arta/anaconda3/envs/py2/bin/python /home/arta/anaconda3/envs/py2/lib/python2.7/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmp88lNX5
           cwd: /tmp/pip-install-hhbrzT/opencv-python
    

    I found this website, that, when translated to English, describes a workaround whereby you specify a version of opencv that is compatible with Python 2.7:

    python -m pip install opencv-python==4.2.0.32

    I hope this helps.

    opened by aseyedia 1
  • Is it possible to get a weights of the discriminator which was used during the training of VGNet?

    Is it possible to get a weights of the discriminator which was used during the training of VGNet?

    Hello @lelechen63,

    I found your work very interesting. I want to use it for the finetuning on my own data, however I could not find a pretrained discriminator for the model. And when I trying to fine-tune with only generator the model breaks. Could you please advice me on where it can be found. In the link provided on github only a generetor part is available.
    Thank you in advance

    opened by kail-ai 0
  • training vgnet.py

    training vgnet.py

    while training vgnet.py, the model didn't find the file " new_img_full_gt_train.pkl", what should this file contain? and how to create it? could anyone who worked on it help?

    thank you in advance.

    opened by Mora-max 3
Owner
Lele Chen
I am a Ph.D candidate in University of Rochester supervised by Prof. Chenling Xu.
Lele Chen
Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021) Hang Zhou, Yasheng Sun, Wayne Wu, Chen Cha

Hang_Zhou 628 Dec 28, 2022
Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022) Paper | Demo Requirements Python >= 3.6 , Pytorch >

FuxiVirtualHuman 84 Jan 3, 2023
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 43 Nov 21, 2022
Cross-modal Deep Face Normals with Deactivable Skip Connections

Cross-modal Deep Face Normals with Deactivable Skip Connections Victoria Fernández Abrevaya*, Adnane Boukhayma*, Philip H. S. Torr, Edmond Boyer (*Equ

null 72 Nov 27, 2022
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

Polarized Self-Attention: Towards High-quality Pixel-wise Regression This is an official implementation of: Huajun Liu, Fuqiang Liu, Xinyi Fan and Don

DeLightCMU 212 Jan 8, 2023
Pixel-wise segmentation on VOC2012 dataset using pytorch.

PiWiSe Pixel-wise segmentation on the VOC2012 dataset using pytorch. FCN SegNet PSPNet UNet RefineNet For a more complete implementation of segmentati

Bodo Kaiser 378 Dec 30, 2022
Code for "PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation" CVPR 2019 oral

Good news! We release a clean version of PVNet: clean-pvnet, including how to train the PVNet on the custom dataset. Use PVNet with a detector. The tr

ZJU3DV 722 Dec 27, 2022
Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera. This project prepares training and testing data for various deep learning projects such as 6D object pose estimation projects singleshotpose, as well as object detection and instance segmentation projects.

null 305 Dec 16, 2022
Retinal Vessel Segmentation with Pixel-wise Adaptive Filters (ISBI 2022)

Retinal Vessel Segmentation with Pixel-wise Adaptive Filters (ISBI 2022) Introdu

anonymous 14 Oct 27, 2022
Official code of Retinal Vessel Segmentation with Pixel-wise Adaptive Filters and Consistency Training

Official code of Retinal Vessel Segmentation with Pixel-wise Adaptive Filters and Consistency Training (ISBI 2022)

anonymous 7 Feb 10, 2022
Cross-Modal Contrastive Learning for Text-to-Image Generation

Cross-Modal Contrastive Learning for Text-to-Image Generation This repository hosts the open source JAX implementation of XMC-GAN. Setup instructions

Google Research 94 Nov 12, 2022
Official code for CVPR2022 paper: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

?? Depth-Aware Generative Adversarial Network for Talking Head Video Generation (CVPR 2022) ?? If DaGAN is helpful in your photos/projects, please hel

Fa-Ting Hong 503 Jan 4, 2023
Multi-scale discriminator feature-wise loss function

Multi-Scale Discriminative Feature Loss This repository provides code for Multi-Scale Discriminative Feature (MDF) loss for image reconstruction algor

Graphics and Displays group - University of Cambridge 76 Dec 12, 2022
Cross Quality LFW: A database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

Cross-Quality Labeled Faces in the Wild (XQLFW) Here, we release the database, evaluation protocol and code for the following paper: Cross Quality LFW

Martin Knoche 10 Dec 12, 2022
DVG-Face: Dual Variational Generation for Heterogeneous Face Recognition, TPAMI 2021

DVG-Face: Dual Variational Generation for HFR This repo is a PyTorch implementation of DVG-Face: Dual Variational Generation for Heterogeneous Face Re

null 52 Dec 30, 2022
A large-scale face dataset for face parsing, recognition, generation and editing.

CelebAMask-HQ [Paper] [Demo] CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA da

switchnorm 1.7k Dec 26, 2022
LBK 20 Dec 2, 2022
Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation) Download Synthia dataset The model uses

null 32 Sep 21, 2022