The official code for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".

Overview

SpeechDrivesTemplates

The official repo for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".

[arxiv / video]

Our paper and this repo focus on upper-body pose generation from audio. To synthesize images from poses, please refer to this Pose2Img repo.

  • Code
  • Model
  • Data preparation

Package Hierarchy

|-- config
|     |-- default.py
|     |-- voice2pose_s2g_speech2gesture.yaml        # baseline: speech2gesture
|     |-- voice2pose_sdt_vae_speech2gesture.yaml    # ours (VAE)
|     |-- pose2pose_speech2gesture.yaml             # gesture reconstruction  
|     `-- voice2pose_sdt_bp_speech2gesture.yaml     # ours (Backprop)
|
|-- core
|     |-- datasets
|     |-- netowrks
|     |-- pipelines
|     \-- utils
|
|-- dataset
|     \-- speech2gesture  # create a soft link here
|
|-- output
|     \-- <date-config-tag>  # A directory for each experiment
|
`-- main.py

Setup the Dataset

Datasets shuold be placed in the dataset directory. Just create a soft link like this:

ln -s <path-to-SPEECH2GESTURE-dataset> ./dataset/speech2gesture

For your own dataset, you need to implement a subclass of torch.utils.data.Dataset in core/datasets/custom_dataset.py.

Train

Train a Model from Scratch

python main.py --config_file configs/voice2pose_sdt_bp_speech2gesture.yaml \
    --tag DEV \
    SYS.NUM_WORKERS 32
  • --tag set the name of the experiment which wil be displayed in the outputfile.
  • You can overwrite the any parameters defined in voice2pose_default.py by simply adding it at the end of the command. The example above set SYS.NUM_WORKERS to 32 temporarily.

Resume Training from an Interrupted Experiment

python main.py --config_file configs/voice2pose_sdt_bp_speech2gesture.yaml \
    --resume_from <checkpoint-to-continue-from>
  • This command will load the state_dict from the checkpoint for both the model and the optimizer, and write results to the original directory that the checkpoint lies in.

Training from a pretrained model

python main.py --config_file configs/voice2pose_sdt_bp_speech2gesture.yaml \
    --pretrain_from <checkpoint-to-continue-from> \
    --tag DEV
  • This command will only load the state_dict for the model, and write results to a new base directory.

Test

To test the model, run this command:

python main.py --config_file configs/voice2pose_sdt_bp_speech2gesture.yaml \
    --tag DEV \
    --test-only \
    --checkpoint <path-to-checkpoint>

Demo

python main.py --config_file configs/voice2pose_sdt_bp_speech2gesture.yaml \
    --tag <DEV> \
    --demo_input <audio.wav> \
    --checkpoint <path-to-checkpoint> \
    DATASET.SPEAKER oliver \
    SYS.VIDEO_FORMAT "['mp4']"

Important Details

Dataset caching

We turn on dataset caching (DATASET.CACHING) by default to speed up training.

If you encounter errors in the dataloader like RuntimeError: received 0 items of ancdata, please increase ulimit by running the command ulimit -n 262144. (refer to this issue)

DataParallel and DistributedDataParallel

We use single GPU (warpped by DataParallel) by default since it is fast enough with dataset caching. For multi-GPU training, we recommand using DistributedDataParallel (DDP) because it provide SyncBN across GPU cards. To enable DDP, set SYS.DISTRIBUTED to True and set SYS.WORLD_SIZE according to the number of GPUs.

When using DDP, assure that the batch_size can be divided exactly by SYS.WORLD_SIZE.

Misc

  • To run any module other than the main files in the root directory, for example the core\datasets\speech2gesture.py file, you should run python -m core.datasets.speech2gesture rather than python core\datasets\speech2gesture.py. This is an interesting problem of Python's relative importing which deserves in-depth thinking.
  • We save a checkpoint and conduct validation after each epoch. You can change the interval in the config file.
  • We generate and save 2 videos in each epoch when training. During validation, we sample 8 videos for each epoch. These videos are saved in tensorborad (without sound) and mp4 (with sound). You can change the SYS.VIDEO_FORMAT parameter to select one or two of them.
  • We usually sett NUM_WORKERS to 32 for best performance. If you encounter any error about memory, try lower NUM_WORKERS.
@inproceedings{qian2021speech,
  title={Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates},
  author={Qian, Shenhan and Tu, Zhi and Zhi, YiHao and Liu, Wen and Gao, Shenghua},
  journal={International Conference on Computer Vision (ICCV)},
  year={2021}
}
Comments
  • Dataset processing

    Dataset processing

    I have a few questions regarding the dataset processing pipeline,

    • at generate_clips script, why the start index is 80 ??
    • why there are 13 records in each clip labeled idle in the train test split file ??
    • are there any parameters I would need to adjust when creating my own data set??

    btw there is an error in 3_2_split_train_val_test.py that you naming the validation samples "val" while the model searches for "dev" labeled records.

    opened by maherr13 4
  • No audio files in datasets.

    No audio files in datasets.

    "To ease later research, we pack our processed data including 2d human pose sequences and corresponding audio clips." Hello, I download the dataset from the link you provide,but I found there is no audio files ,just have npz files. Should I generate audio files by myself ? I want to use Luo's data to train model .

    opened by liuweie 2
  • RuntimeError: External code not provide

    RuntimeError: External code not provide

    I run the following command when I want to test the VAE method demo:python main.py --config_file configs/voice2pose_sdt_vae.yaml --tag luo --demo_input audio1.wav --checkpoint checkpoints/voice2pose_sdt_vae-luo-ep100.pth DATASET.SPEAKER luo Then the following error occured: `Traceback (most recent call last): File "main.py", line 73, in main()

    File "main.py", line 69, in main run(args, cfg) File "main.py", line 45, in run pipeline.demo(cfg, exp_tag, args.checkpoint, args.demo_input) File "D:\SpeechDrivesTemplates\core\pipelines\trainer.py", line 462, in demo self.base_path = self.setup_experiment(False, exp_tag, checkpoint=checkpoint, demo_input=demo_input) File "D:\SpeechDrivesTemplates\core\pipelines\trainer.py", line 221, in setup_experiment self.setup_model(self.cfg, state_dict=checkpoint['model_state_dict']) File "D:\SpeechDrivesTemplates\core\pipelines\voice2pose.py", line 221, in setup_model self.model = Voice2PoseModel(cfg, state_dict, self.num_train_samples, self.get_rank()).cuda() File "D:\SpeechDrivesTemplates\core\pipelines\voice2pose.py", line 48, in init raise RuntimeError('External code not provide.') RuntimeError: External code not provide.`

    opened by liuweie 2
  • Checkpoint

    Checkpoint

    Your model is fascinating and I would like to test the model, can you provide the file about checkpoint please? 您好!我们对您提出的方法十分感兴趣。我们看一看实现效果,但是没有找到权重在哪里下载,可以提供一下吗?十分感谢!

    opened by Violetaye 2
  • language dependent

    language dependent

    Hi, how much do you think the model is language dependent? or do you think it is more dependent on the sound of the audio? Thank you for the checkpoints, I managed to make it work :)

    opened by ireneb612 2
  • windows下save_video_in_mp4函数执行至ffmpeg.concat报错解决方案

    windows下save_video_in_mp4函数执行至ffmpeg.concat报错解决方案

    First of all, thank the author for replying to me by email and providing me with some solutions. Now that the problem has been solved, provide an issue for reference.

    When I reproduce the code to see the demo effect, I run the following command:python main.py --config_file configs/voice2pose_sdt_bp.yaml --tag luo --demo_input audio1.wav --checkpoint voice2pose_sdt_bp-luo-ep100.pth DATASET.SPEAKER luo The following error occurred: FileNotFoundError:[WinError 2]系统找不到指定的文件

    I tried many methods, but still could not run the ffmpeg.concat function correctly. Finally, my solution is as follows (give up using ffmpeg.concat and use other methods): image

    opened by liuweie 1
  • Keypoints format

    Keypoints format

    Thanks for your Awesome work! While generating keypoints using openpose, output format is json file and in your code is npy file. 2_1_gen_kpts.py is not completed yet, Are there instructions to reshape keypoints as required in your script?

    opened by Ibrahimatef 1
  • 生成的视频不是真人视频,是关键点的视频,是什么原因

    生成的视频不是真人视频,是关键点的视频,是什么原因

    python main.py --config_file configs/voice2pose_sdt_bp.yaml
    --tag oliver
    --demo_input demo_audio.wav
    --checkpoint
    DATASET.SPEAKER oliver 我是按照这个脚本去生成的,生成也成功了,但是是关键点的视频,不是真人的视频,语音匹配上了,但是没有合成真人的视频,大佬能给解答一下嘛

    opened by heiheiwangergou 2
  • How to generate gestures corresponding to specific semantics?

    How to generate gestures corresponding to specific semantics?

    Hello, I read your great paper recently ,I don't know much about Co-speech generation task, so I have some questions and would like to ask you for advice:

    1. I found that neither your paper nor other similar SOTA papers seem to incorporate too many semantic features, so is this model unable to generate corresponding gestures with specific semantics? For example: When I say "here", can I generate a corresponding pointing gesture?
    2. At present, there are two main types of methods for the Co-speech generation task: rule-based method and data-driven method. If I want to generate corresponding gestures in specific semantics, should I combine the effect of the rule-based method?
    opened by liuweie 1
  • SYS.DISTRIBUTED

    SYS.DISTRIBUTED

    I'm trying to train xing processed_data from scratch using DDP, SYS.DISTRIBUTED True SYS.WORLD_SIZE 4 (4 GPUS)

    RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

    opened by marson666 7
Owner
Qian Shenhan
Qian Shenhan
Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Dataset and Code for RealVSR Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme Xi Yang, Wangmeng Xiang,

Xi Yang 91 Nov 22, 2022
Code for the paper "DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks" (ICCV '19)

DewarpNet This repository contains the codes for DewarpNet training. Recent Updates [May, 2020] Added evaluation images and an important note about Ma

CVLab@StonyBrook 354 Jan 1, 2023
[ICCV, 2021] Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks

Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks This is an official PyTorch code repository of the paper "Cloud Transformers:

Visual Understanding Lab @ Samsung AI Center Moscow 27 Dec 15, 2022
Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

An Image is Worth 16x16 Words, What is a Video Worth? paper Official PyTorch Implementation Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor DAMO Academy, Al

null 213 Nov 12, 2022
CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering" official PyTorch implementation.

LED2-Net This is PyTorch implementation of our CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering". Y

Fu-En Wang 83 Jan 4, 2023
Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

This repository hosts the source code of our paper: [AAAI 2021]Sequential End-to-end Network for Efficient Person Search. SeqNet achieves the state-of

Zj Li 218 Dec 31, 2022
SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

NVIDIA Research Projects 31 Nov 22, 2022
Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

Head Detector Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd. The head_detection mod

Ramana Subramanyam 76 Dec 6, 2022
Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Dual Encoding for Video Retrieval by Text Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding

null 81 Dec 1, 2022
Implement 'Single Shot Text Detector with Regional Attention, ICCV 2017 Spotlight'

SSTDNet Implement 'Single Shot Text Detector with Regional Attention, ICCV 2017 Spotlight' using pytorch. This code is work for general object detecti

HotaekHan 84 Jan 5, 2022
The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

3D Human Pose Estimation with Spatial and Temporal Transformers This repo is the official implementation for 3D Human Pose Estimation with Spatial and

Ce Zheng 363 Dec 28, 2022
This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

Ju He 307 Jan 3, 2023
Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral) Paper | Project Page | Pre-trained Model | CLIP-Initialized Pre-trained

Applied Research Center (ARC), Tencent PCG 99 Jan 6, 2023
Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

ROCA: Robust CAD Model Alignment and Retrieval from a Single Image (CVPR 2022) Code release of our paper ROCA. Check out our video, paper, and website

null 123 Dec 25, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023
Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

DataTuner You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task. See LICENSE.txt for license de

null 81 Jan 1, 2023
Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

AFSD: Learning Salient Boundary Feature for Anchor-free Temporal Action Localization This is an official implementation in PyTorch of AFSD. Our paper

Tencent YouTu Research 146 Dec 24, 2022
This is the code for our paper DAAIN: Detection of Anomalous and AdversarialInput using Normalizing Flows

Merantix-Labs: DAAIN This is the code for our paper DAAIN: Detection of Anomalous and Adversarial Input using Normalizing Flows which can be found at

Merantix 14 Oct 12, 2022
Code for the ACL2021 paper "Combining Static Word Embedding and Contextual Representations for Bilingual Lexicon Induction"

CSCBLI Code for our ACL Findings 2021 paper, "Combining Static Word Embedding and Contextual Representations for Bilingual Lexicon Induction". Require

Jinpeng Zhang 12 Oct 8, 2022