TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting
Project Page | YouTube | Paper
This is the official PyTorch implementation of the CVPR 2020 paper "TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting".
Environment
conda install pytorch torchvision cudatoolkit=<your cuda version>
conda install pyyaml scikit-image scikit-learn opencv
pip install -r requirements.txt
Data
Mixamo
Mixamo is a synthesized 3D character animation dataset.
- Download mixamo data here.
- Extract under
data/mixamo
For directions for downloading 3D Mixamo data please refer to this link.
SoloDance
SoloDance is a collection of dancing videos on youtube. We use DensePose to extract skeleton sequences from these videos for training.
- Download the extracted skeleton sequences here.
- Extract under
data/solo_dance
The original videos can be downloaded here.
Preprocessing
run sh scripts/preprocess.sh
to preprocess the two datasets above.
Pretrained model
Download the pretrained models here.
Inference
-
For Skeleton Extraction, please consider using a pose estimation library such as Detectron2. We require the input skeleton sequences to be in the format of a numpy
.npy
file:- The file should contain an array with shape
15 x 2 x length
. - The first dimension (15) corresponds the 15 body joint defined here.
- The second dimension (2) corresponds to x and y coordinates.
- The third dimension (length) is the temporal dimension.
- The file should contain an array with shape
-
For Motion Retargeting Network, we provide the sample command for inference:
python infer_pair.py
--config configs/transmomo.yaml
--checkpoint transmomo_mixamo_36_800_24/checkpoints/autoencoder_00200000.pt # replace with actual path
--source a.npy # replace with actual path
--target b.npy # replace with actual path
--source_width 1280 --source_height 720
--target_height 1920 --target_width 1080
- For Skeleton-to-Video Rendering, please refer to Everybody Dance Now.
Training
To train the Motion Retargeting Network, run
python train.py --config configs/transmomo.yaml
To train on the SoloDance dataest, run
python train.py --config configs/transmomo_solo_dance.yaml
Testing
For testing motion retargeting MSE, first generate the motion-retargeted motions with
python test.py
--config configs/transmomo.yaml # replace with the actual config used for training
--checkpoint transmomo_mixamo_36_800_24/checkpoints/autoencoder_00200000.pt
--out_dir transmomo_mixamo_36_800_24_results # replace actual path to output directory
And then compute MSE by
python scripts/compute_mse.py
--in_dir transmomo_mixamo_36_800_24_results # replace with the previous output directory
Project Structure
transmomo.pytorch
├── configs - configuration files
├── data - place for storing data
├── docs - documentations
├── lib
│ ├── data.py - datasets and dataLoaders
│ ├── networks - encoders, decoders, discriminators, etc.
│ ├── trainer.py - training pipeline
│ ├── loss.py - loss functions
│ ├── operation.py - operations, e.g. rotation, projection, etc.
│ └── util - utility functions
├── out - place for storing output
├── infer_pair.py - perform motion retargeting
├── render_interpolate.py - perform motion and body interpolation
├── scripts - scripts for data processing and experiments
├── test.py - test MSE
└── train.py - main entrance for training
TODOs
-
Detailed documentation
-
Add example files
-
Release in-the-wild dancing video dataset (unannotated)
-
Tool for visualizing Mixamo test error
-
Tool for converting keypoint formats
Citation
Z. Yang*, W. Zhu*, W. Wu*, C. Qian, Q. Zhou, B. Zhou, C. C. Loy. "TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting." IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. (* indicates equal contribution.)
BibTeX:
@inproceedings{transmomo2020,
title={TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting},
author={Yang, Zhuoqian and Zhu, Wentao and Wu, Wayne and Qian, Chen and Zhou, Qiang and Zhou, Bolei and Loy, Chen Change},
booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2020}
}
Acknowledgement
This repository is partly based on Rundi Wu's Learning Character-Agnostic Motion for Motion Retargeting in 2D and Xun Huang's MUNIT: Multimodal UNsupervised Image-to-image Translation. The skeleton-to-rendering part is based on Everybody Dance Now. We sincerely thank them for their inspiration and contribution to the community.