AudioDVP:Photorealistic Audio-driven Video Portraits

Last update: Jan 3, 2023

Related tags

Audio AudioDVP

Overview

AudioDVP

This is the official implementation of Photorealistic Audio-driven Video Portraits.

Major Requirements

Ubuntu >= 18.04
PyTorch >= 1.2
GCC >= 7.5
NVCC >= 10.1
FFmpeg (with H.264 support)

FYI, detailed environment setup is in enviroment.yml. (You definitely don't have to install all of them, just install what you need when you encounter an import error.)

Major implementation differences against original paper

Geometry parameter and texture parameter of 3DMM is now initialized from zero and shared among all samples during fitting, since it is more reasonable.
Using OpenCV rather than PIL for image editing operation.

Usage

1. Download face model data

Download Basel Face Model 2009. (Register and get 01_MorphableModel.mat.)
Download expression basis from 3DFace. (There is an Exp_Pca.bin in CoarseData.)
Download auxiliary files from Deep3DFaceReconstruction.

Put the data in renderer/data like the structure below.

renderer/data
├── 01_MorphableModel.mat
├── Exp_Pca.bin
├── BFM_front_idx.mat
├── BFM_exp_idx.mat
├── facemodel_info.mat
├── select_vertex_id.mat
├── std_exp.txt
└── data.mat(This is generated by the step 2 below.)

2. Build data

cd renderer/
python build_data.py

3.Download pretrained model of ATnet

The link is here.
Put atnet_lstm_18.pth in vendor/ATVGnet/model.

4.Download pretrained ResNet on VGGFace2

The link is here.
Put resnet50_ft_weight.pkl in weights

5.Download Trump speech video

The link is here. (Video courtesy of The White House.)
Put it in data/video

6.Compile CUDA rasterizer kernel

cd renderer/kernels
python setup.py build_ext --inplace

7.Running demo script

# Explanation of every step is provided.
./scripts/demo.sh

Since we provide both training and inference code, we won't upload pretrained model for brevity at present. We provide expected result in data/sample_result.mp4 using synthesized audio in data/test_audio.

Acknowledgment

This work is build upon many great open source code and data.

Many implementation details are learned from Deep3DFaceReconstruction.
ATVGnet in the vendor directory is directly borrowed from ATVGnet under MIT License.
neural-face-renderer in the vendor directory is heavily borrowed from CycleGAN and pix2pix in PyTorch under BSD License.
The pre-trained ResNet model on VGGFace2 dataset is from VGGFace2-pytorch under MIT License.
Basel2009 3D face dataset is from here.
The expression basis of 3DMM is from 3DFace under GPL License.
Our renderer is heavily borrowed from tf_mesh_renderer and inspired by pytorch_mesh_renderer.

Notification

Our method is built upon Deep Video Portraits.
Our method adopts a person-specific Audio2Expression module, which is not robust enough than a universal one trained on large dataset such as Lip Reading Sentences in the Wild. A universal one is encouraged! Fortunately, our method works quite well on WaveNet sythesized audio like provided in data/test_audio.
The code IS NOT fully tested on another clean machine.
There is a known bug in the rasterizer that several pixels of rendered face are black (not assigned with any color) in some corner conditions due to float point error which I can't fix.

Disclaimer

We made this code publicly available to benefit graphics and vision community. Please DO NOT abuse the code for devil things.

Citation

@article{wen2020audiodvp,
    author={Xin Wen and Miao Wang and Christian Richardt and Ze-Yin Chen and Shi-Min Hu},
    journal={IEEE Transactions on Visualization and Computer Graphics}, 
    title={Photorealistic Audio-driven Video Portraits}, 
    year={2020},
    volume={26},
    number={12},
    pages={3457-3466},
    doi={10.1109/TVCG.2020.3023573}
}

License

BSD

Comments

Step: Crop and resize video frames too slow

crop and resize video frames

python utils/crop_portrait.py \

--data_dir $video_dir \

--crop_level 2.0 \

--vertical_adjust 0.2

Do you have the same problem? Or is it have to be slow?

opened by House-Leo 6
Can't find expression basis

Hi,

Thank you for your amazing work! I want to run your code, but I can't find expression basis from https://github.com/Juyong/3DFace .TAT

Did I miss important information?

opened by qianyunw 6
RuntimeError: triangles.is_contiguous() INTERNAL ASSERT FAILED at "rasterize_triangles.cpp"

This error occurs when I run the train.py script. I don't know how to fix it. python train.py --data_dir $video_dir --num_epoch 20 --serial_batches False --display_freq 400 --print_freq 400 --batch_size 5

opened by iamchenxin-coder 4
When I go to the "dataset [singledataset] was created" step，Something went wrong

Error in rasterize_triangles_cuda_forward: invalid device function Error in rasterize_triangles_cuda_forward: invalid device function (epoch: 0, iters: 400, data: 0.293, comp: 0.205) Photometric: 0.00000 Landmark: 1836.84644 Alpha: 0.05272 Beta: 0.00000 Delta: 63.04750

opened by supersarr 3
triangles must be contiguous

RuntimeError: triangles.is_contiguous() INTERNAL ASSERT FAILED at rasterize_triangles.cpp:47, please report a bug to PyTorch. triangles must be contiguous

I meet this problem,do you know how to solve it? thanks.
duplicate

opened by ghost 3
RuntimeError: triangles.is_contiguous() INTERNAL ASSERT FAILED ,can u give some ad

when I run:

# # 3D face reconstruction

python train.py
--data_dir $video_dir
--num_epoch 20
--serial_batches False
--display_freq 400
--print_freq 400
--batch_size 5

opened by anthonyyuan 3
关于Audio2expression模型相关求助

作者您好，感谢您开源的相关代码。留意到您的Audio2expression模型仅使用3个卷积+一个全连接，我们在tensorboard上观察loss下降至3e-3左右，我们使用之进行推理发现存在相关问题： 1.嘴唇变化幅度较小 2.人物嘴唇存在抖动情况(类似于打哆嗦) 我们尝试修改Audio2expression(增加全连接层数量，增大batch_size)可让loss降低至1.5e-4，但是我们发现其表现效果更差了(哆嗦情况加重，反复抖动) 您是否有比较好的意见或者建议呢？感谢！

opened by 821029883 2
./scripts/demo.sh: 行 104: 30820 已杀死会爆内存

./scripts/demo.sh: 行 104: 30820 已杀死会爆内存，应该在哪部降低内存呢？

./scripts/demo.sh: 行 104: 30820 已杀死 /home/chenbl/anaconda3/envs/cbl/bin/python3 vendor/neural-face-renderer/train.py --dataroot $video_dir/nfr/AB --name fr --model nfr --checkpoints_dir $video_dir/checkpoints --netG unet_256 --direction BtoA --lambda_L1 100 --dataset_mode temporal --norm batch --pool_size 0 --use_refine --nput_nc 21 --Nw 7 --batch_size 8 --preprocess none --num_threads 4 --n_epochs 6 --n_epochs_decay 0 --load_size 256

opened by ghost 2
pretrained model

Hello, can you provide a pre training model?Hello, can you provide a pre training model? Want to see other specific effects, baidu cloud link what do you have?

opened by chensheng-code 2
Could not find the rasterize_triangles_cpp

ERROR: Could not find a version that satisfies the requirement rasterize_triangles_cpp==0.0.0 (from versions: none) ERROR: No matching distribution found for rasterize_triangles_cpp==0.0.0 try： 1）won't work to change all kinds of sources 2）won't work to delete "==0.0.0"

opened by doctorimage 2

fix issues #9 and #23

Fix the error occured while using Pytorch 1.8 and CUDA 11.1.

RuntimeError: triangles.is_contiguous() INTERNAL ASSERT FAILED at "rasterize_triangles.cpp":47, please report a bug to PyTorch. triangles must be contiguous

opened by quqixun 1

Owner

GitHub https://github.com/xinwen-cs/AudioDVP

Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios.

Summary Pyroomacoustics is a software package aimed at the rapid development and testing of audio array processing algorithms. The content of the pack

1k Jan 9, 2023

AudioDVP:Photorealistic Audio-driven Video Portraits

Related tags

Overview

AudioDVP

Major Requirements

Major implementation differences against original paper

Usage

1. Download face model data

2. Build data

3.Download pretrained model of ATnet

4.Download pretrained ResNet on VGGFace2

5.Download Trump speech video

6.Compile CUDA rasterizer kernel

7.Running demo script

Acknowledgment

Notification

Disclaimer

Citation

License

Comments

crop and resize video frames

python utils/crop_portrait.py \

--data_dir $video_dir \

--crop_level 2.0 \

--vertical_adjust 0.2

# # 3D face reconstruction

Owner

cross-library (GStreamer + Core Audio + MAD + FFmpeg) audio decoding for Python

Audio spatialization over WebRTC and JACK Audio Connection Kit

Audio augmentations library for PyTorch for audio in the time-domain

praudio provides audio preprocessing framework for Deep Learning audio applications

convert-to-opus-cli is a Python CLI program for converting audio files to opus audio format.

This bot can stream audio or video files and urls in telegram voice chats

Just-Music - Spotify API Driven Music Web app, that allows to listen and control and share songs

Audio fingerprinting and recognition in Python

kapre: Keras Audio Preprocessors

Python library for audio and music analysis

?️ Open Source Audio Matching and Mastering

Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications

Manipulate audio with a simple and easy high level interface

Scalable audio processing framework written in Python with a RESTful API

Python module for handling audio metadata

Python I/O for STEM audio files

Python library for handling audio datasets.

An audio digital processing toolbox based on a workflow/pipeline principle

Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios.