Code for CVPR 2022 paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory"

Overview

Bailando

Code for CVPR 2022 (oral) paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory"

[Paper] | [Project Page] | [Video Demo]

Do not hesitate to give a star!

Driving 3D characters to dance following a piece of music is highly challenging due to the spatial constraints applied to poses by choreography norms. In addition, the generated dance sequence also needs to maintain temporal coherency with different music genres. To tackle these challenges, we propose a novel music-to-dance framework, Bailando, with two powerful components: 1) a choreographic memory that learns to summarize meaningful dancing units from 3D pose sequence to a quantized codebook, 2) an actor-critic Generative Pre-trained Transformer (GPT) that composes these units to a fluent dance coherent to the music. With the learned choreographic memory, dance generation is realized on the quantized units that meet high choreography standards, such that the generated dancing sequences are confined within the spatial constraints. To achieve synchronized alignment between diverse motion tempos and music beats, we introduce an actor-critic-based reinforcement learning scheme to the GPT with a newly-designed beat-align reward function. Extensive experiments on the standard benchmark demonstrate that our proposed framework achieves state-of-the-art performance both qualitatively and quantitatively. Notably, the learned choreographic memory is shown to discover human-interpretable dancing-style poses in an unsupervised manner.

Code

Environment

PyTorch == 1.6.0

Data preparation

In our experiments, we use AIST++ for both training and evaluation. Please visit here to download the AIST++ annotations and unzip them as './aist_plusplus_final/' folder, visit here to download all original music pieces (wav) into './aist_plusplus_final/all_musics'. And please set up the AIST++ API from here and download the required SMPL models from here. Please make a folder './smpl' and copy the downloaded 'male' SMPL model (with '_m' in name) to 'smpl/SMPL_MALE.pkl' and finally run

./prepare_aistpp_data.sh

to produce the features for training and test. Otherwise, directly download our preprocessed feature from here as ./data folder if you don't wish to process the data.

Training

The training of Bailando comprises of 4 steps in the following sequence. If you are using the slurm workload manager, you can directly run the corresponding shell. Otherwise, please remove the 'srun' parts. Our models are all trained with single NVIDIA V100 GPU. * A kind reminder: the quantization code does not fit multi-gpu training

Step 1: Train pose VQ-VAE (without global velocity)

sh srun.sh configs/sep_vqvae.yaml train [your node name] 1

Step 2: Train glabal velocity branch of pose VQ-VAE

sh srun.sh configs/sep_vavqe_root.yaml train [your node name] 1

Step 3: Train motion GPT

sh srun_gpt_all.sh configs/cc_motion_gpt.yaml train [your node name] 1

Step 4: Actor-Critic finetuning on target music

sh srun_actor_critic.sh configs/actor_critic.yaml train [your node name] 1

Evaluation

To test with our pretrained models, please download the weights from here (Google Drive) or separately downloads the four weights from [weight 1]|[weight 2]|[weight 3]|[weight4] (坚果云) into ./experiments folder.

1. Generate dancing results

To test the VQ-VAE (with or without global shift as you indicated in config):

sh srun.sh configs/sep_vqvae.yaml eval [your node name] 1

To test GPT:

sh srun_gpt_all.sh configs/cc_motion_gpt.yaml eval [your node name] 1

To test final restuls:

sh srun_actor_critic.sh configs/actor_critic.yaml eval [your node name] 1

2. Dance quality evaluations

After generating the dance in the above step, run the following codes.

Step 1: Extract the (kinetic & manual) features of all AIST++ motions (ONLY do it by once):

python extract_aist_features.py

Step 2: compute the evaluation metrics:

python utils/metrics_new.py

It will show exactly the same values reported in the paper. To fasten the computation, comment Line 184 of utils/metrics_new.py after computed the ground-truth feature once. To test another folder, change Line 182 to your destination, or kindly modify this code to a "non hard version" :)

Choreographic for music in the wild

TODO

Citation

@inproceedings{siyao2022bailando,
    title={Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory,
    author={Siyao, Li and Yu, Weijiang and Gu, Tianpei and Lin, Chunze and Wang, Quan and Qian, Chen and Loy, Chen Change and Liu, Ziwei },
    booktitle={CVPR},
    year={2022}
}

License

Our code is released under MIT License.

Comments
  • Some file missing in data.zip

    Some file missing in data.zip

    Hi, thanks for sharing your code! I download the data.zip that you released. However, I got an error here. I check the data and I found that aistpp_music_feat_7.5fps/mJB4.json is empty. Is it a mistake?

    opened by xljh0520 3
  • The generated person in video disappear in the video sometimes?

    The generated person in video disappear in the video sometimes?

    Thank you for your work! I follow the steps of Choreographic for music in the wild. I get the output videos of the model. Problem: But I find the person in the videos would disappear sometimes.And there is no sound in the video. Q1 : Is there a way to solve the problem? Q2: Will this kind of disappearance affect the location of 3d keypoints?

    opened by aleeyang 2
  • ValueError: only one element tensors can be converted to Python scalars

    ValueError: only one element tensors can be converted to Python scalars

    hey, I face this problem using the first step command python -u main.py --config configs/sep_vqvae.yaml --train THE OUTPUT IS using SepVQVAE We use bottleneck! No motion regularization! We use bottleneck! No motion regularization! train with AIST++ dataset! test with AIST++ dataset! {'structure': {'name': 'SepVQVAE', 'up_half': {'levels': 1, 'downs_t': [3], 'strides_t': [2], 'emb_width': 512, 'l_bins': 512, 'l_mu': 0.99, 'commit': 0.02, 'hvqvae_multipliers ': [1], 'width': 512, 'depth': 3, 'm_conv': 1.0, 'dilation_growth_rate': 3, 'sample_length': 240, 'use_bottleneck': True, 'joint_channel': 3, 'vel': 1, 'acc': 1, 'vqvae_reverse _decoder_dilation': True, 'dilation_cycle': None}, 'down_half': {'levels': 1, 'downs_t': [3], 'strides_t': [2], 'emb_width': 512, 'l_bins': 512, 'l_mu': 0.99, 'commit': 0.02, ' hvqvae_multipliers': [1], 'width': 512, 'depth': 3, 'm_conv': 1.0, 'dilation_growth_rate': 3, 'sample_length': 240, 'use_bottleneck': True, 'joint_channel': 3, 'vel': 1, 'acc': 1, 'vqvae_reverse_decoder_dilation': True, 'dilation_cycle': None}, 'use_bottleneck': True, 'joint_channel': 3, 'l_bins': 512}, 'loss_weight': {'mse_weight': 1}, 'optimizer': {'type': 'Adam', 'kwargs': {'lr': 3e-05, 'betas': [0.5, 0.999], 'weight_decay': 0}, 'schedular_kwargs': {'milestones': [100, 200], 'gamma': 0.1}}, 'data': {'name': 'aist', 'tra in_dir': 'data/aistpp_train_wav', 'test_dir': 'data/aistpp_test_full_wav', 'seq_len': 240, 'data_type': 'None'}, 'testing': {'height': 540, 'width': 960, 'ckpt_epoch': 500}, 'e xpname': 'sep_vqvae', 'epoch': 500, 'batch_size': 32, 'save_per_epochs': 20, 'test_freq': 20, 'log_per_updates': 1, 'seed': 42, 'rotmat': False, 'cuda': True, 'global_vel': Fal se, 'ds_rate': 8, 'move_train': 40, 'sample_code_length': 150, 'sample_code_rate': 16, 'analysis_sequence': [[126, 81]], 'config': 'configs/sep_vqvae.yaml', 'train': True, 'eva l': False, 'visgt': False, 'anl': False, 'sample': False} 07/03/2022 03:06:44 Epoch: 1 Traceback (most recent call last): File "main.py", line 56, in <module> main() File "main.py", line 40, in main agent.train() File "/home/fuyang/project/Bailando/motion_vqvae.py", line 107, in train 'loss': loss.item(), **ValueError: only one element tensors can be converted to Python scalars** My environment is pytorch 1.11.0+cu102 8xGPU NVIDIA TITAN Xp(12196MiB)

    opened by aleeyang 2
  • training problem

    training problem

    I meet error at Step 1 by running python -u main.py --config configs/sep_vqvae.yaml --train

    Traceback (most recent call last):
      File "main.py", line 56, in <module>
        main()
      File "main.py", line 40, in main
        agent.train()
      File "/share/yanzhen/Bailando/motion_vqvae.py", line 94, in train
        loss.backward()
      File "/root/anaconda3/envs/workspace/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
      File "/root/anaconda3/envs/workspace/lib/python3.8/site-packages/torch/autograd/__init__.py", line 166, in backward
        grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
      File "/root/anaconda3/envs/workspace/lib/python3.8/site-packages/torch/autograd/__init__.py", line 67, in _make_grads
        raise RuntimeError("grad can be implicitly created only for scalar outputs")
    RuntimeError: grad can be implicitly created only for scalar outputs
    

    After print the loss, it looks like tensor([0.2667, 0.2735, 0.2687, 0.2584, 0.2701, 0.2697, 0.2571, 0.2658], device='cuda:0', grad_fn=<GatherBackward>), so do I need to take a mean or sum operation?

    However, even if I take a mean operation, the training still seems problematic. The loss decreases normally, while in eval stage, the output quants are all zero. Any suggestion?

    The training log is attached for reference.

    log.txt

    @lisiyao21

    opened by haofanwang 2
  • Why extracting audio features twice with different sampling rate?

    Why extracting audio features twice with different sampling rate?

    Hi, Siyao~ Thanks for releasing and cleaning the code!!

    May I ask why in the pre-processing part, the audio (music) features are extracted twice and with different sampling rates?

    Precisely, in _prepro_aistpp.py, the audio features are extracted with the sampling rate 15360*2

    While in _prepro_aistpp_music.py, the audio features are extracted with the sampling rate 15360*2/8

    opened by XiSHEN0220 2
  • Processed data

    Processed data

    Hi, There is no link to the processed data, since you said that "Otherwise, directly download our preprocessed feature from here as ./data folder if you don't wish to process the data."

    Can you add the data link? Thanks!

    opened by LaLaLailalai 2
  • where is  the definition of

    where is the definition of "from .utils.torch_utils import assert_shape"

    Hi @lisiyao21 Thank you for release code! where is the definition of "from .utils.torch_utils import assert_shape" https://github.com/lisiyao21/Bailando/blob/27fe2b63896a2e31928b22944bac10455413263e/models/encdec.py#L4

    opened by zhangsanfeng86 2
  • Visualize gt error.

    Visualize gt error.

    I want to compare the generated results with the ground truth. It seems that the code also support visualize ground truth by passing the visgt parameter. However when I call the script using the visgt parameter, it seems like that the code is not fully implemented: the last two parameters of visualizeAndWrite function is not set correctly. How should I set these two parameters (especially the last quants para) to make the function execute correctly?

    opened by miibotree 1
  • No such file or directory: '/mnt/lustre/share/lisiyao1/original_videos/aistpp-audio/'

    No such file or directory: '/mnt/lustre/share/lisiyao1/original_videos/aistpp-audio/'

    Bailando/utils/functional.py", line 130, in img2video music_names = sorted(os.listdir(audio_dir)) FileNotFoundError: [Errno 2] No such file or directory: '/mnt/lustre/share/lisiyao1/original_videos/aistpp-audio/'

    opened by donghaoye 1
  • about the data process

    about the data process

    Hello siyao! I'm reading your code and i'm confused about the 'align' function in '_prepro_aistpp.py'. To make the length(time) of music equal to that of dance, you throw the extra feature away. Is that reasonable? Why not do a uniform sampling? Sorry for bothering you. image

    opened by pengc02 0
  • about the run command

    about the run command

    For me, a beginner of DL, sh srun.sh configs/sep_vqvae.yaml train [your node name] 1 what dose the '[your node name]' mean? can you give me a more specific command? Thank you a lot!

    opened by aleeyang 0
  • Warning when evaluating

    Warning when evaluating

    Hi, siyao! I run the command python extract_aist_features.py to extract the (kinetic & manual) features of all AIST++ motions. However, I met with a warning:

    WARNING: You are using a SMPL model, with only 10 shape coefficients.
    

    Do you know the reason?

    opened by LinghaoChan 1
  • About

    About "cc_motion_gpt.yaml"

    Hi siyao, Thanks for your great work. I have a question. When I train in step 3 (Train motion GPT),an error occurs---"AttributeError: 'MCTall' object has no attribute 'training_data' ". And I check the "cc_motion_gpt.yaml", found the "need_not_train_data: True", which causes the "def _build_train_loader(self):" not work. Is that correct?Or should I change "need_not_train_data" to "false"?

    opened by im1eon 1
  • Is there a way to change the ‘Starting pose codes’?

    Is there a way to change the ‘Starting pose codes’?

    Hi,thank you for your work again! It really inspires me and bring me interest in deep learning! amazing job! Problem: I found the generated music dance videos in the same style which may not coordinated with my music(‘青花瓷-jay_chou’). I supposed it may be caused by the starting pose code.But I can not find how to choose and where to set it.

    Q1: Is there a way to change the ‘Starting pose codes’ which mentioned in your paper? Q2: How to choose the starting pose codes? Is there a table I can find explicitly mapping the starting pose codes to dance style

    Thank you again! Aleeyanger

    opened by aleeyang 0
  • The meanings of FID_k and FID_g of groudtruth?

    The meanings of FID_k and FID_g of groudtruth?

    Hi authors,

    Thank you for your fantastic work! I have a small question: In Table 1, FID_k, FID_g of groundtruth are reported. I am a little bit confused with this. Do they mean to compute FID_k and FID_g between the two same sets of groundtruth data? In other words, why FID_k and FID_g of the groundtruth are not 0?

    Thank you, Best

    opened by by2101 1
  • Out of Memory

    Out of Memory

    Hi there, When I run the second step as your instructions, I met the "out of memory" problem. I tried to debug the problem and found it is because the music_data is float64 and the memory is consumed rapidly when converting the list music_data to music_np. (in File "utils/functional.py"). Have you ever met the same problem like me? Is it possible to use float32 for training data(music_np) without decreasing the performance of the final model?

    BTW: There are 120G of memory in my computer.

    opened by lucaskingjade 1
Owner
Li Siyao
an interesting PhD student
Li Siyao
Code for CVPR'2022 paper ✨ "Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model"

PPE ✨ Repository for our CVPR'2022 paper: Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-

Zipeng Xu 33 Sep 30, 2022
Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral) Paper | Project Page | Pre-trained Model | CLIP-Initialized Pre-trained

Applied Research Center (ARC), Tencent PCG 87 Sep 25, 2022
Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

ROCA: Robust CAD Model Alignment and Retrieval from a Single Image (CVPR 2022) Code release of our paper ROCA. Check out our video, paper, and website

null 112 Sep 28, 2022
Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Dual Encoding for Video Retrieval by Text Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding

null 75 Sep 27, 2022
Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

Head Detector Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd. The head_detection mod

Ramana Subramanyam 73 Sep 22, 2022
WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching Code based on our WACV 2022 Accepted Paper: https://arxiv.org/pdf/

Andres 11 Sep 5, 2022
CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering" official PyTorch implementation.

LED2-Net This is PyTorch implementation of our CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering". Y

Fu-En Wang 80 Sep 7, 2022
Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

SynthText Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Ved

Ankush Gupta 1.8k Sep 23, 2022
Code release for Hu et al., Learning to Segment Every Thing. in CVPR, 2018.

Learning to Segment Every Thing This repository contains the code for the following paper: R. Hu, P. Dollár, K. He, T. Darrell, R. Girshick, Learning

Ronghang Hu 418 Sep 13, 2022
When Age-Invariant Face Recognition Meets Face Age Synthesis: A Multi-Task Learning Framework (CVPR 2021 oral)

MTLFace This repository contains the PyTorch implementation and the dataset of the paper: When Age-Invariant Face Recognition Meets Face Age Synthesis

Hzzone 111 Sep 27, 2022
(CVPR 2021) Back-tracing Representative Points for Voting-based 3D Object Detection in Point Clouds

BRNet Introduction This is a release of the code of our paper Back-tracing Representative Points for Voting-based 3D Object Detection in Point Clouds,

null 85 Sep 22, 2022
Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)

SA-AutoAug Scale-aware Automatic Augmentation for Object Detection Yukang Chen, Yanwei Li, Tao Kong, Lu Qi, Ruihang Chu, Lei Li, Jiaya Jia [Paper] [Bi

Jia Research Lab 180 Sep 19, 2022
(CVPR 2021) ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection

ST3D Code release for the paper ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection, CVPR 2021 Authors: Jihan Yang*, Shaoshu

CVMI Lab 209 Sep 27, 2022
Distilling Knowledge via Knowledge Review, CVPR 2021

ReviewKD Distilling Knowledge via Knowledge Review Pengguang Chen, Shu Liu, Hengshuang Zhao, Jiaya Jia This project provides an implementation for the

DV Lab 168 Sep 5, 2022
Automatically download multiple papers by keywords in CVPR

CVFPaperHelper Automatically download multiple papers by keywords in CVPR Install mkdir PapersToRead cd PaperToRead pip install requests tqdm git clon

null 46 Jun 8, 2022
Code for the paper "DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks" (ICCV '19)

DewarpNet This repository contains the codes for DewarpNet training. Recent Updates [May, 2020] Added evaluation images and an important note about Ma

CVLab@StonyBrook 333 Sep 20, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 494 Sep 20, 2022
Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

This repository hosts the source code of our paper: [AAAI 2021]Sequential End-to-end Network for Efficient Person Search. SeqNet achieves the state-of

Zj Li 209 Sep 14, 2022
Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

DataTuner You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task. See LICENSE.txt for license de

null 77 Sep 15, 2022