Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Lele Chen

Last update: Dec 27, 2022

Related tags

Deep Learning ATVGnet

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

By Lele Chen , Ross K Maddox, Zhiyao Duan, Chenliang Xu.

University of Rochester.

Introduction
Citation
Running
Model
Results
Disclaimer and known issues

Introduction

This repository contains the original models (AT-net, VG-net) described in the paper Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss. The demo video is avaliable at https://youtu.be/eH7h_bDRX2Q. This code can be applied directly in LRW and GRID. The outputs from the model are visualized here: the first one is the synthesized landmark from ATnet, the rest of them are attention, motion map and final results from VGnet.

Citation

If you use any codes, models or the ideas from this repo in your research, please cite:

@inproceedings{chen2019hierarchical,
  title={Hierarchical cross-modal talking face generation with dynamic pixel-wise loss},
  author={Chen, Lele and Maddox, Ross K and Duan, Zhiyao and Xu, Chenliang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={7832--7841},
  year={2019}
}

Running

This code is tested under Python 2.7. The model we provided is trained on LRW. However, it works fine on GRID,VOXCELB and other datasets. You can directly compare this model on other dataset with your own model. We treat this as fair comparison.
Pytorch environment:Pytorch 0.4.1. (conda install pytorch=0.4.1 torchvision cuda90 -c pytorch)
Install requirements.txt (pip install -r requirement.txt)
Download the pretrained ATnet and VGnet weights at google drive. Put the weights under model folder.
Run the demo code: python demo.py
- -device_ids: gpu id
- -cuda: using cuda or not
- -vg_model: pretrained VGnet weight
- -at_model: pretrained ATnet weight
- -lstm: use lstm or not
- -p: input example image
- -i: input audio file
- -lstm: use lstm or not
- -sample_dir: folder to save the outputs
- ...
Download and unzip the training data from LRW
Preprocess the data (Extract landmark and crop the image by dlib).
Train the ATnet model: python atnet.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_dir: folder to save weights
- -lstm: use lstm or not
- -sample_dir: folder to save visualized images during training
- ...
Test the model: python atnet_test.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_name: pretrained weights
- -sample_dir: folder to save the outputs
- -lstm: use lstm or not
- ...
Train the VGnet: python vgnet.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_dir: folder to save weights
- -sample_dir: folder to save visualized images during training
- ...
Test the VGnet: python vgnet_test.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_name: pretrained weights
- -sample_dir: folder to save the outputs
- ...

Model

Overall ATVGnet
Regresssion based discriminator network

Results

Result visualization on different datasets:
Reuslt compared with other SOTA methods:
The studies on image robustness respective with landmark accuracy:
Quantitative results:

Disclaimer and known issues

These codes are implmented in Pytorch.
In this paper, we train LRW and GRID seperately.
The model are sensitive to input images. Please use the correct preprocessing code.
I didn't finish the data processing code yet. I will release it soon. But you can try the model and replace with your own image.
If you want to train these models using this version of pytorch without modifications, please notice that:
- You need at lest 12 GB GPU memory.
- There might be some other untested issues.
There is another intresting and useful research on audio to landmark genration. Please check it out at https://github.com/eeskimez/Talking-Face-Landmarks-from-Speech.

Todos

Release training data

License

MIT

Comments

Segmentation fault when running program

Hi @lelechen63. When I ran the demo, I encountered Segmentation fault. After debugging, it was found that torch and Dlib should be the cause. Can you share the version of Dlib?

opened by HallidayReadyOne 11
is there some processed example data for reference?

I am trying to create data for this project. Is there some images or landmark*.npy can be downloaded? just as one processed landmark npy and image region~ thank you

opened by lianDaniel 10
Wrong images generated

Hi, I clone the code and run the demo.py, nothing error, but the video and the images generated are totally in wrong type like this.

img

motion

attention

By the way, I run in CPU mode. thanks in advance.

opened by taylorlu 8
Using Pre-trained Model on GRID dataset
Dear Authors, Thanks for sharing the code. I just wanted to know whether the pre-trained model which you have released can be used on GRID dataset or not? I am only interested to run your demo.py. For example, I want to

give a frame from GRID dataset as a target frame

provide a .wav audio file from GRID

Do you think that should work. Or do I need to take any special care for GRID dataset demo.
opened by avisekiit 4
Where is the network architecture?

We are trying to deploy this project on an android application. In order to do so, we need to convert the pretrained pytorch model (atnet_lstm_18.pth and generator_23.pth) into tensorflow but it shows an error of 'state_dict'. When i load the pre trained models, it only gives the weights but not the architecture. Can you guide me where i can find the Model architecture? And how to convert it to tensorflow?

opened by AtaUllahB 4
The use of mean_shape_norm.npy & S.npy & S_3d.npy?

Hello, I learned about ATVGNet at CVPR2019 site.And this is a very interesting work！ But when I read code after that,some place made me confuse. 1.What is the use of mean_shape_norm.npy & S.npy & S_3d.npy, I know it try to normazile landmark,but What is the individual function of the parts(mean_shape_norm.npy & S.npy & S_3d.npy)? Knowing this might give me a better understanding of the code.

2.How to get mean_shape_norm.npy & S.npy & S_3d.npy?

look forward to you reply

opened by Songluchuan 3
AT-net label相关问题

作者您好，感谢您出色的工作。我有两个疑问比较疑惑 1.留意到您在AT-net中的dataset处理中，将mfcc特征向量堆叠为16个拼接在一起。想问一下堆叠为16的依据是什么?(我们在deepspeech中也有留意到相似的做法) "t_mfcc =mfcc[(r + ind - 3)*4: (r + ind + 4)*4, 1:]" 这一步操作已经是选取了前后3帧共计280ms的特征向量，为什么还要将之拼接16次呢？ 2."landmark =lmark[r+1 : r + 17,:]" 正常来讲，我们的label不是以当前帧为中心，为什么这里选取了当前帧往后的16帧的landmark作为标签？

期待您的解答，感激不尽！

opened by 821029883 2
Comparing on GRID dataset
Dear Authors, Thanks for the awesome release of the paper and code.

I was trying to compare our result with yours on the GRID dataset for the LMD metric. Can you please tell me that in the paper

Which subjects IDs of GRID did you use for testing.

How many keypoints did you use for each subject ? I usually use a dlib detector which gives me 68 keypoints.

Do you perform any normalization of the keypoints (after getting raw pixel coordinates using a dlib detector) to get rid of scale effects before calculating the difference on real and synthetic faces?

Lastly, when you report the SSIM and PSNR: do you calculate those metrics on the entire frame or just cropped out face regions. I just wanted to make sure that we compare fairly with you. So, keenly looking forward to your kind reply.

Thanks, Avisek Lahiri
opened by avisekiit 2
State_dict

Im trying to run this script.

import torchvision import torch from somefile import modelarchitecture model = modelarchitecture() model.load_state_dict(torch.load(???)) model.eval()

Please guide me which values will be set for "model" and path inside torch.load?

opened by AtaUllahB 2
Question about PCA preprocessing
In demo.py, you multiply the example_landmark by 5 before applying PCA,

example_landmark = example_landmark * 5.0 example_landmark = example_landmark - mean.expand_as(example_landmark) example_landmark = torch.mm(example_landmark, pca)

And for fake_landmarks, you multiply 2 times 1.1~1.5 before applying PCA

fake_lmark[:, 1:6] *= 2 * torch.FloatTensor(np.array([1.1, 1.2, 1.3, 1.4, 1.5])).cuda() fake_lmark = torch.mm(fake_lmark, pca.t()) fake_lmark = fake_lmark + mean.expand_as(fake_lmark)

so (1) do you apply different scaling parameters for example_landmark and fake_landmarks? (2) how are those scaling parameters (5 vs 2.2~3.0) being selected?
opened by pcgreat 2
Is it possible to generate the landmark from the VG output?

Hi: I'm testing to get the landmarks from the output facial images. Since the VG output images are cropped, the landmarks not so stable from dlib. Is that possible to generate the landmarks directly from VG net? Since we already have a landmark input for the vg net. thanks

opened by snowzhangy 2
error running demo.py cv2
Traceback (most recent call last): File "demo.py", line 486, in test() File "demo.py", line 465, in test fake_store = restore_image(orgImage,rect,fake_store,indx) File "demo.py", line 196, in restore_image cv2.normalize(img, img, 0, 255, cv2.NORM_MINMAX) cv2.error: OpenCV(4.5.5) :-1: error: (-5:Bad argument) in function 'normalize'

Overload resolution failed:

Layout of the output array dst is incompatible with cv::Mat

Expected Ptrcv::UMat for argument 'dst'
opened by Nyrize 0
Error when installing opencv
https://github.com/lelechen63/ATVGnet/blob/2d4d1b03df1c706c6575b942fd2a1585347c4aab/requirement.txt#L9

The above line in requirements.txt throws a pretty long error that is not worth reproducing in its entirety:

Getting requirements to build wheel ... error ERROR: Command errored out with exit status 1: command: /home/arta/anaconda3/envs/py2/bin/python /home/arta/anaconda3/envs/py2/lib/python2.7/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmp88lNX5 cwd: /tmp/pip-install-hhbrzT/opencv-python

I found this website, that, when translated to English, describes a workaround whereby you specify a version of opencv that is compatible with Python 2.7:

python -m pip install opencv-python==4.2.0.32

I hope this helps.
opened by aseyedia 1
Is it possible to get a weights of the discriminator which was used during the training of VGNet?

Hello @lelechen63,

I found your work very interesting. I want to use it for the finetuning on my own data, however I could not find a pretrained discriminator for the model. And when I trying to fine-tune with only generator the model breaks. Could you please advice me on where it can be found. In the link provided on github only a generetor part is available.
Thank you in advance

opened by kail-ai 0
training vgnet.py

while training vgnet.py, the model didn't find the file " new_img_full_gt_train.pkl", what should this file contain? and how to create it? could anyone who worked on it help?

thank you in advance.

opened by Mora-max 3

img
motion
attention

Owner

Lele Chen

I am a Ph.D candidate in University of Rochester supervised by Prof. Chenling Xu.

GitHub

Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021) Hang Zhou, Yasheng Sun, Wayne Wu, Chen Cha

628 Dec 28, 2022

Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022) Paper | Demo Requirements Python >= 3.6 , Pytorch >

84 Jan 3, 2023

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

43 Nov 21, 2022

Cross-modal Deep Face Normals with Deactivable Skip Connections

Cross-modal Deep Face Normals with Deactivable Skip Connections Victoria Fernández Abrevaya*, Adnane Boukhayma*, Philip H. S. Torr, Edmond Boyer (*Equ

72 Nov 27, 2022

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

28 Dec 30, 2022

This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

Polarized Self-Attention: Towards High-quality Pixel-wise Regression This is an official implementation of: Huajun Liu, Fuqiang Liu, Xinyi Fan and Don

212 Jan 8, 2023

Pixel-wise segmentation on VOC2012 dataset using pytorch.

PiWiSe Pixel-wise segmentation on the VOC2012 dataset using pytorch. FCN SegNet PSPNet UNet RefineNet For a more complete implementation of segmentati

378 Dec 30, 2022

Code for "PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation" CVPR 2019 oral

Good news! We release a clean version of PVNet: clean-pvnet, including how to train the PVNet on the custom dataset. Use PVNet with a detector. The tr

722 Dec 27, 2022

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera. This project prepares training and testing data for various deep learning projects such as 6D object pose estimation projects singleshotpose, as well as object detection and instance segmentation projects.

305 Dec 16, 2022

Hierarchical-Bayesian-Defense - Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical Variational Inference (Openreview)

Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical V

20 Dec 2, 2022

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation) Download Synthia dataset The model uses

32 Sep 21, 2022

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Related tags

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Table of Contents

Introduction

Citation

Running

Model

Results

Disclaimer and known issues

Todos

License

Comments

Owner

Lele Chen

Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Cross-modal Deep Face Normals with Deactivable Skip Connections

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

Pixel-wise segmentation on VOC2012 dataset using pytorch.

Code for "PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation" CVPR 2019 oral

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Retinal Vessel Segmentation with Pixel-wise Adaptive Filters (ISBI 2022)

Official code of Retinal Vessel Segmentation with Pixel-wise Adaptive Filters and Consistency Training

Cross-Modal Contrastive Learning for Text-to-Image Generation

Official code for CVPR2022 paper: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

Multi-scale discriminator feature-wise loss function

Cross Quality LFW: A database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

DVG-Face: Dual Variational Generation for Heterogeneous Face Recognition, TPAMI 2021

A large-scale face dataset for face parsing, recognition, generation and editing.

Hierarchical-Bayesian-Defense - Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical Variational Inference (Openreview)

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)