Code for CVPR 2022 paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory"

Li Siyao

Last update: Dec 29, 2022

Related tags

Computer Vision Bailando

Overview

Bailando

Code for CVPR 2022 (oral) paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory"

[Paper] | [Project Page] | [Video Demo]

✨ Do not hesitate to give a star! ✨

Driving 3D characters to dance following a piece of music is highly challenging due to the spatial constraints applied to poses by choreography norms. In addition, the generated dance sequence also needs to maintain temporal coherency with different music genres. To tackle these challenges, we propose a novel music-to-dance framework, Bailando, with two powerful components: 1) a choreographic memory that learns to summarize meaningful dancing units from 3D pose sequence to a quantized codebook, 2) an actor-critic Generative Pre-trained Transformer (GPT) that composes these units to a fluent dance coherent to the music. With the learned choreographic memory, dance generation is realized on the quantized units that meet high choreography standards, such that the generated dancing sequences are confined within the spatial constraints. To achieve synchronized alignment between diverse motion tempos and music beats, we introduce an actor-critic-based reinforcement learning scheme to the GPT with a newly-designed beat-align reward function. Extensive experiments on the standard benchmark demonstrate that our proposed framework achieves state-of-the-art performance both qualitatively and quantitatively. Notably, the learned choreographic memory is shown to discover human-interpretable dancing-style poses in an unsupervised manner.

Code

Environment

PyTorch == 1.6.0

Data preparation

In our experiments, we use AIST++ for both training and evaluation. Please visit here to download the AIST++ annotations and unzip them as './aist_plusplus_final/' folder, visit here to download all original music pieces (wav) into './aist_plusplus_final/all_musics'. And please set up the AIST++ API from here and download the required SMPL models from here. Please make a folder './smpl' and copy the downloaded 'male' SMPL model (with '_m' in name) to 'smpl/SMPL_MALE.pkl' and finally run

./prepare_aistpp_data.sh

to produce the features for training and test. Otherwise, directly download our preprocessed feature from here as ./data folder if you don't wish to process the data.

Training

The training of Bailando comprises of 4 steps in the following sequence. If you are using the slurm workload manager, you can directly run the corresponding shell. Otherwise, please remove the 'srun' parts. Our models are all trained with single NVIDIA V100 GPU. * A kind reminder: the quantization code does not fit multi-gpu training

Step 1: Train pose VQ-VAE (without global velocity)

sh srun.sh configs/sep_vqvae.yaml train [your node name] 1

Step 2: Train glabal velocity branch of pose VQ-VAE

sh srun.sh configs/sep_vavqe_root.yaml train [your node name] 1

Step 3: Train motion GPT

sh srun_gpt_all.sh configs/cc_motion_gpt.yaml train [your node name] 1

Step 4: Actor-Critic finetuning on target music

sh srun_actor_critic.sh configs/actor_critic.yaml train [your node name] 1

Evaluation

To test with our pretrained models, please download the weights from here (Google Drive) or separately downloads the four weights from [weight 1]|[weight 2]|[weight 3]|[weight4] (坚果云) into ./experiments folder.

1. Generate dancing results

To test the VQ-VAE (with or without global shift as you indicated in config):

sh srun.sh configs/sep_vqvae.yaml eval [your node name] 1

To test GPT:

sh srun_gpt_all.sh configs/cc_motion_gpt.yaml eval [your node name] 1

To test final restuls:

sh srun_actor_critic.sh configs/actor_critic.yaml eval [your node name] 1

2. Dance quality evaluations

After generating the dance in the above step, run the following codes.

Step 1: Extract the (kinetic & manual) features of all AIST++ motions (ONLY do it by once):

python extract_aist_features.py

Step 2: compute the evaluation metrics:

python utils/metrics_new.py

It will show exactly the same values reported in the paper. To fasten the computation, comment Line 184 of utils/metrics_new.py after computed the ground-truth feature once. To test another folder, change Line 182 to your destination, or kindly modify this code to a "non hard version" :)

Choreographic for music in the wild

TODO

Citation

@inproceedings{siyao2022bailando,
    title={Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory,
    author={Siyao, Li and Yu, Weijiang and Gu, Tianpei and Lin, Chunze and Wang, Quan and Qian, Chen and Loy, Chen Change and Liu, Ziwei },
    booktitle={CVPR},
    year={2022}
}

License

Our code is released under MIT License.

Comments

Some file missing in data.zip

Hi, thanks for sharing your code! I download the data.zip that you released. However, I got an error here. I check the data and I found that aistpp_music_feat_7.5fps/mJB4.json is empty. Is it a mistake?

opened by xljh0520 3
Why extracting audio features twice with different sampling rate?

Hi, Siyao~ Thanks for releasing and cleaning the code!!

May I ask why in the pre-processing part, the audio (music) features are extracted twice and with different sampling rates?

Precisely, in _prepro_aistpp.py, the audio features are extracted with the sampling rate 15360*2

While in _prepro_aistpp_music.py, the audio features are extracted with the sampling rate 15360*2/8

opened by XiSHEN0220 3
The generated person in video disappear in the video sometimes?

Thank you for your work! I follow the steps of Choreographic for music in the wild. I get the output videos of the model. Problem: But I find the person in the videos would disappear sometimes.And there is no sound in the video. Q1 : Is there a way to solve the problem? Q2: Will this kind of disappearance affect the location of 3d keypoints?

opened by aleeyang 2
ValueError: only one element tensors can be converted to Python scalars

hey, I face this problem using the first step command python -u main.py --config configs/sep_vqvae.yaml --train THE OUTPUT IS using SepVQVAE We use bottleneck! No motion regularization! We use bottleneck! No motion regularization! train with AIST++ dataset! test with AIST++ dataset! {'structure': {'name': 'SepVQVAE', 'up_half': {'levels': 1, 'downs_t': [3], 'strides_t': [2], 'emb_width': 512, 'l_bins': 512, 'l_mu': 0.99, 'commit': 0.02, 'hvqvae_multipliers ': [1], 'width': 512, 'depth': 3, 'm_conv': 1.0, 'dilation_growth_rate': 3, 'sample_length': 240, 'use_bottleneck': True, 'joint_channel': 3, 'vel': 1, 'acc': 1, 'vqvae_reverse _decoder_dilation': True, 'dilation_cycle': None}, 'down_half': {'levels': 1, 'downs_t': [3], 'strides_t': [2], 'emb_width': 512, 'l_bins': 512, 'l_mu': 0.99, 'commit': 0.02, ' hvqvae_multipliers': [1], 'width': 512, 'depth': 3, 'm_conv': 1.0, 'dilation_growth_rate': 3, 'sample_length': 240, 'use_bottleneck': True, 'joint_channel': 3, 'vel': 1, 'acc': 1, 'vqvae_reverse_decoder_dilation': True, 'dilation_cycle': None}, 'use_bottleneck': True, 'joint_channel': 3, 'l_bins': 512}, 'loss_weight': {'mse_weight': 1}, 'optimizer': {'type': 'Adam', 'kwargs': {'lr': 3e-05, 'betas': [0.5, 0.999], 'weight_decay': 0}, 'schedular_kwargs': {'milestones': [100, 200], 'gamma': 0.1}}, 'data': {'name': 'aist', 'tra in_dir': 'data/aistpp_train_wav', 'test_dir': 'data/aistpp_test_full_wav', 'seq_len': 240, 'data_type': 'None'}, 'testing': {'height': 540, 'width': 960, 'ckpt_epoch': 500}, 'e xpname': 'sep_vqvae', 'epoch': 500, 'batch_size': 32, 'save_per_epochs': 20, 'test_freq': 20, 'log_per_updates': 1, 'seed': 42, 'rotmat': False, 'cuda': True, 'global_vel': Fal se, 'ds_rate': 8, 'move_train': 40, 'sample_code_length': 150, 'sample_code_rate': 16, 'analysis_sequence': [[126, 81]], 'config': 'configs/sep_vqvae.yaml', 'train': True, 'eva l': False, 'visgt': False, 'anl': False, 'sample': False} 07/03/2022 03:06:44 Epoch: 1 Traceback (most recent call last): File "main.py", line 56, in <module> main() File "main.py", line 40, in main agent.train() File "/home/fuyang/project/Bailando/motion_vqvae.py", line 107, in train 'loss': loss.item(), **ValueError: only one element tensors can be converted to Python scalars** My environment is pytorch 1.11.0+cu102 8xGPU NVIDIA TITAN Xp(12196MiB)

opened by aleeyang 2
about the run command

For me, a beginner of DL, sh srun.sh configs/sep_vqvae.yaml train [your node name] 1 what dose the '[your node name]' mean? can you give me a more specific command? Thank you a lot!

opened by aleeyang 2

training problem

I meet error at Step 1 by running python -u main.py --config configs/sep_vqvae.yaml --train

Traceback (most recent call last):
  File "main.py", line 56, in <module>
    main()
  File "main.py", line 40, in main
    agent.train()
  File "/share/yanzhen/Bailando/motion_vqvae.py", line 94, in train
    loss.backward()
  File "/root/anaconda3/envs/workspace/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/anaconda3/envs/workspace/lib/python3.8/site-packages/torch/autograd/__init__.py", line 166, in backward
    grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
  File "/root/anaconda3/envs/workspace/lib/python3.8/site-packages/torch/autograd/__init__.py", line 67, in _make_grads
    raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs

After print the loss, it looks like tensor([0.2667, 0.2735, 0.2687, 0.2584, 0.2701, 0.2697, 0.2571, 0.2658], device='cuda:0', grad_fn=<GatherBackward>), so do I need to take a mean or sum operation?

However, even if I take a mean operation, the training still seems problematic. The loss decreases normally, while in eval stage, the output quants are all zero. Any suggestion?

The training log is attached for reference.

log.txt

@lisiyao21

opened by haofanwang 2

Processed data

Hi, There is no link to the processed data, since you said that "Otherwise, directly download our preprocessed feature from here as ./data folder if you don't wish to process the data."

Can you add the data link? Thanks!

opened by LaLaLailalai 2
where is the definition of "from .utils.torch_utils import assert_shape"

Hi @lisiyao21 Thank you for release code! where is the definition of "from .utils.torch_utils import assert_shape" https://github.com/lisiyao21/Bailando/blob/27fe2b63896a2e31928b22944bac10455413263e/models/encdec.py#L4

opened by zhangsanfeng86 2
Visualize gt error.

I want to compare the generated results with the ground truth. It seems that the code also support visualize ground truth by passing the visgt parameter. However when I call the script using the visgt parameter, it seems like that the code is not fully implemented: the last two parameters of visualizeAndWrite function is not set correctly. How should I set these two parameters (especially the last quants para) to make the function execute correctly?

opened by miibotree 1
No such file or directory: '/mnt/lustre/share/lisiyao1/original_videos/aistpp-audio/'

Bailando/utils/functional.py", line 130, in img2video music_names = sorted(os.listdir(audio_dir)) FileNotFoundError: [Errno 2] No such file or directory: '/mnt/lustre/share/lisiyao1/original_videos/aistpp-audio/'

opened by donghaoye 1
about the data process

Hello siyao! I'm reading your code and i'm confused about the 'align' function in '_prepro_aistpp.py'. To make the length(time) of music equal to that of dance, you throw the extra feature away. Is that reasonable? Why not do a uniform sampling? Sorry for bothering you.

opened by pengc02 0
Doubts about the Bailando model

You have completed a very good model!

I also achieved very good results when I was working on your model. But there are still some questions that are not very clear. Are you experiencing gradient explosion when implementing the Actor-Critic Learning module? My model still converged at the first epoch, and it did have some improvement compared to GPT. However, during the subsequent iterations, L_AC increased significantly and could not continue to converge. And the visualization results also became very strange.

Looking forward for your reply!

opened by WJ-Fifth 0
Problem with the node name.

Thank you for your perfect work. I try to use the Choreographic for music in the wild. But when I run the command, I didn't know how to set the node name. Can anyone else help me?

Thank you very much.

opened by Xianjin111 0
Issues with data preprocess

Thanks for sharing your wonderful work! I wonder how you get the specific number, 15360 * 2, as the sampling rate for 60 FPS. Can you elaborate how specific rate is obtained through calculation?

Another concern is with beats extractions using librosa, in _prepro_aistpp_music.py, I found that onset_env, onset_beat, tempogram are all-0s. Is this correct?

opened by KevinGoodman 0
Warning when evaluating
Hi, siyao! I run the command python extract_aist_features.py to extract the (kinetic & manual) features of all AIST++ motions. However, I met with a warning:

WARNING: You are using a SMPL model, with only 10 shape coefficients.

Do you know the reason?
opened by LinghaoChan 1
About "cc_motion_gpt.yaml"

Hi siyao, Thanks for your great work. I have a question. When I train in step 3 (Train motion GPT)，an error occurs---"AttributeError: 'MCTall' object has no attribute 'training_data' ". And I check the "cc_motion_gpt.yaml", found the "need_not_train_data: True", which causes the "def _build_train_loader(self):" not work. Is that correct？Or should I change "need_not_train_data" to "false"?

opened by im1eon 1

Owner

Li Siyao

an interesting PhD student

GitHub

Code for CVPR'2022 paper ✨ "Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model"

PPE ✨ Repository for our CVPR'2022 paper: Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-

34 Nov 28, 2022

Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral) Paper | Project Page | Pre-trained Model | CLIP-Initialized Pre-trained

Applied Research Center (ARC), Tencent PCG

99 Jan 6, 2023

Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

ROCA: Robust CAD Model Alignment and Retrieval from a Single Image (CVPR 2022) Code release of our paper ROCA. Check out our video, paper, and website

123 Dec 25, 2022

Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Dual Encoding for Video Retrieval by Text Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding

81 Dec 1, 2022

Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

Head Detector Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd. The head_detection mod

76 Dec 6, 2022

WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching Code based on our WACV 2022 Accepted Paper: https://arxiv.org/pdf/

13 Dec 17, 2022

CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering" official PyTorch implementation.

LED2-Net This is PyTorch implementation of our CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering". Y

83 Jan 4, 2023

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

SynthText Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Ved

1.8k Dec 28, 2022

Code for CVPR 2022 paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory"

Related tags

Overview

Bailando

Code

Environment

Data preparation

Training

Step 1: Train pose VQ-VAE (without global velocity)

Step 2: Train glabal velocity branch of pose VQ-VAE

Step 3: Train motion GPT

Step 4: Actor-Critic finetuning on target music

Evaluation

1. Generate dancing results

2. Dance quality evaluations

Step 1: Extract the (kinetic & manual) features of all AIST++ motions (ONLY do it by once):

Step 2: compute the evaluation metrics:

Choreographic for music in the wild

Citation

License

Comments

Owner

Li Siyao

Code for CVPR'2022 paper ✨ "Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model"

Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering" official PyTorch implementation.

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

Code release for Hu et al., Learning to Segment Every Thing. in CVPR, 2018.

When Age-Invariant Face Recognition Meets Face Age Synthesis: A Multi-Task Learning Framework (CVPR 2021 oral)

(CVPR 2021) Back-tracing Representative Points for Voting-based 3D Object Detection in Point Clouds

Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)

(CVPR 2021) ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection

Distilling Knowledge via Knowledge Review, CVPR 2021

Automatically download multiple papers by keywords in CVPR

Code for the paper "DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks" (ICCV '19)

Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper