This is the official PyTorch implementation of the CVPR 2020 paper "TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting".

Zhuoqian Yang

Last update: Dec 11, 2022

Related tags

Overview

TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting

Project Page | YouTube | Paper

This is the official PyTorch implementation of the CVPR 2020 paper "TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting".

Environment

conda install pytorch torchvision cudatoolkit=<your cuda version>
conda install pyyaml scikit-image scikit-learn opencv
pip install -r requirements.txt

Data

Mixamo

Mixamo is a synthesized 3D character animation dataset.

Download mixamo data here.
Extract under data/mixamo

For directions for downloading 3D Mixamo data please refer to this link.

SoloDance

SoloDance is a collection of dancing videos on youtube. We use DensePose to extract skeleton sequences from these videos for training.

Download the extracted skeleton sequences here.
Extract under data/solo_dance

The original videos can be downloaded here.

Preprocessing

run sh scripts/preprocess.sh to preprocess the two datasets above.

Pretrained model

Download the pretrained models here.

Inference

For Skeleton Extraction, please consider using a pose estimation library such as Detectron2. We require the input skeleton sequences to be in the format of a numpy .npy file:
- The file should contain an array with shape 15 x 2 x length.
- The first dimension (15) corresponds the 15 body joint defined here.
- The second dimension (2) corresponds to x and y coordinates.
- The third dimension (length) is the temporal dimension.
For Motion Retargeting Network, we provide the sample command for inference:

python infer_pair.py 
--config configs/transmomo.yaml 
--checkpoint transmomo_mixamo_36_800_24/checkpoints/autoencoder_00200000.pt # replace with actual path
--source a.npy  # replace with actual path
--target b.npy  # replace with actual path
--source_width 1280 --source_height 720 
--target_height 1920 --target_width 1080

For Skeleton-to-Video Rendering, please refer to Everybody Dance Now.

Training

To train the Motion Retargeting Network, run

python train.py --config configs/transmomo.yaml

To train on the SoloDance dataest, run

python train.py --config configs/transmomo_solo_dance.yaml

Testing

For testing motion retargeting MSE, first generate the motion-retargeted motions with

python test.py
--config configs/transmomo.yaml # replace with the actual config used for training
--checkpoint transmomo_mixamo_36_800_24/checkpoints/autoencoder_00200000.pt
--out_dir transmomo_mixamo_36_800_24_results # replace actual path to output directory

And then compute MSE by

python scripts/compute_mse.py 
--in_dir transmomo_mixamo_36_800_24_results # replace with the previous output directory

Project Structure

transmomo.pytorch
├── configs - configuration files
├── data - place for storing data
├── docs - documentations
├── lib
│   ├── data.py - datasets and dataLoaders
│   ├── networks - encoders, decoders, discriminators, etc.
│   ├── trainer.py - training pipeline
│   ├── loss.py - loss functions
│   ├── operation.py - operations, e.g. rotation, projection, etc.
│   └── util - utility functions
├── out - place for storing output
├── infer_pair.py - perform motion retargeting
├── render_interpolate.py - perform motion and body interpolation
├── scripts - scripts for data processing and experiments
├── test.py - test MSE
└── train.py - main entrance for training

TODOs

Detailed documentation
Add example files
Release in-the-wild dancing video dataset (unannotated)
Tool for visualizing Mixamo test error
Tool for converting keypoint formats

Citation

Z. Yang*, W. Zhu*, W. Wu*, C. Qian, Q. Zhou, B. Zhou, C. C. Loy. "TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting." IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. (* indicates equal contribution.)

BibTeX:

@inproceedings{transmomo2020,
  title={TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting},
  author={Yang, Zhuoqian and Zhu, Wentao and Wu, Wayne and Qian, Chen and Zhou, Qiang and Zhou, Bolei and Loy, Chen Change},
  booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020}
}

Acknowledgement

This repository is partly based on Rundi Wu's Learning Character-Agnostic Motion for Motion Retargeting in 2D and Xun Huang's MUNIT: Multimodal UNsupervised Image-to-image Translation. The skeleton-to-rendering part is based on Everybody Dance Now. We sincerely thank them for their inspiration and contribution to the community.

Comments

About Triplet Margin Loss

Hi, thank you for releasing the code!

I have a few questions on the triplet loss function as described in Section 3.2.2. It seems the triplet loss is trying to evaluate how well a generated pose sequence in comparison to a 'real' sequence. The anchor and positive samples in a triplet tuple are selected by using two consecutive results produced by the body encoder with 'real sequence' as input, and another result is treated as the negative sample which is produced by the body encoder part with 'fake' or 'generated' results. According to Eq.4 and Eq.5, both sequences produced by feeding the 'original' input and 'limb-scaled' input can be treated as 'real' output. Therefore, the anchor and positive sample can be selected from both streams in your proposed model? Is that correct the triplet loss is trying to guide the body encoder to generated the 'same' body structure when changing the scale of the input pose?

BTW, is that a bug in the calculation of temporal pairwise cosine similarity of the seqs_b in Line 43 (loss.py)? because the temporal similarity should be also calculated within the seqs_b to obtain the similarity between an 'anchor' and a 'positive' sample?

https://github.com/yzhq97/transmomo.pytorch/blob/5d766fb16f511b77f446abda3697b626595a2b2d/lib/loss.py#L42-L44

Thanks!

opened by AndrewChiyz 1
Questions about latent space interpolation results

Hi, first of all, thank you for releasing such a great code! My question regarding the latent space interpolation is that how the background of the interpolated pose changes. For example Figure 8 in the paper, I can see body structures can be successfully interpolated, but what about the synthesized frames conditioned on them? It would be helpful if you can provide or describe the synthesized video frame. (Is it just a mixture of two frames??)

opened by kangyeolk 1

Error when running infer_pair.py

Hello, I am interested in running your code. The test.py runs fine, however when I try to run the infer_pair.py file with the following command

python infer_pair.py --config configs/transmomo.yaml --checkpoint transmomo_mixamo_36_800_24/checkpoints/autoencoder_00200000.pt --source data/mixamo/36_800_24/test_random_rotate/SPORTY_GRANY/Pulling_A_Rope/Pulling_A_Rope.npy --target "data/mixamo/36_800_24/test_random_rotate/TY/Golf_Tee_Up_(1)/Golf_Tee_Up_(1).npy" --source_width 1280 --source_height 720 --target_height 1080 --target_width 1920

I get the following error

Traceback (most recent call last):
  File "infer_pair.py", line 87, in <module>
    main(config, args)
  File "infer_pair.py", line 77, in main
    x_cross = postprocess(x_cross, mean_pose, std_pose, unit=1.0, start=x_src_start)
  File "/home/rk/projects/transmomo.pytorch/lib/util/motion.py", line 27, in postprocess
    motion = globalize_motion(motion, start=start)
  File "/home/rk/projects/transmomo.pytorch/lib/util/motion.py", line 100, in globalize_motion
    centers += start.reshape([2, 1])
ValueError: cannot reshape array of size 3 into shape (2,1)

It's also the same error with other pairs of motion.

Is there something wrong? Thanks.

opened by RusticKey 1

GPU memrory

Thanks for your excellent work. And I what to know how many GPU memory it Need during training phase and testing phase respectively? And how long does it take to train this network.

opened by swrdZWJ 1
What's the 3d keyjoints format?

Hello, thank you for sharing the code.

I am trying to get 3d key joints by calling reconstruct3d() in network.py. It returns an array in this format : (45, length)

What's the order of these 45 data points?

Thanks

opened by FredericaLee 1
Which point is the center for limb scaling?

Q1: The result is amazing no matter what people is small or large. But I don't find the code you choose which point as the center for scaling structure, so all limbs will shifting together, it is not reasonable. And if we scaling every target's skeleton frame by frame independently, whether the generated result will jittering?

Q2: For all target video, do we just need one common model for generating results?

opened by ailias 1
Body_25 format form detectron

I have fetched the keypoints of an image/video using detectron2 as suggested in the readme. But the format is Cocoo. This project expects the input to be in Body_25 format. How do I convert Cocoo to Body_25?

Also, how do I visualize the output of infer.py? I know that I'm supposed to use motion2video utility, but it doesn't seem to work directly on the output motion of infer.py. I am not sure what format it's expecting the data in.

opened by viggyr 0
I want to convert the video like this and save it as Npy:

I want to convert the video like this and save it as Npy:

The file should contain an array with shape 15 x 2 x length. The first dimension (15) corresponds the 15 body joint defined here. The second dimension (2) corresponds to x and y coordinates. The third dimension (length) is the temporal dimension.

opened by LMR2018 0
How to generate 3D pose

Hi! I've noticed in the paper that the 2D pose can be generated by projecting 3D into 2D, and the retargeting inference that the repo provides is 2D pose generation. I wonder if there are functions which can directly generate 3D pose(which means every point has x,y,z coordinate instead of x,y). Thanks a lot!

opened by Alouette98 0
limb_norm() related issue

Hi @yzhq97, I am facing issue regarding the limb normalisation. I have been trying to normalise each to the limb lengths in scale between 0-1. But it seems that it is not straight forward. I have till now understood that limb normalisation requires multiple dependents of joints to be normalised. But limb_norm(x_a,x_b) takes two different motions to scale x_a according to target x_b skeleton structure. How can I scale limbs between 0-1 since I am not trying to adjust structure according to target skeleton by using limb_norm()?

opened by RahhulDd 0

some parameter's meaning in the code

I want to know what is unit meaning in this function. Hope for your reply.

`def preprocess_test(motion, meanpose, stdpose, unit=128): motion = motion * unit

motion[1, :, :] = (motion[2, :, :] + motion[5, :, :]) / 2
motion[8, :, :] = (motion[9, :, :] + motion[12, :, :]) / 2

start = motion[8, :, 0]
motion = localize_motion(motion)
motion = normalize_motion(motion, meanpose, stdpose)

return motion, start

opened by zq1335030905 5

This is the official PyTorch implementation of the CVPR 2020 paper "TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting".

Related tags

Overview

TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting

Project Page | YouTube | Paper

Environment

Data

Mixamo

SoloDance

Preprocessing

Pretrained model

Inference

Training

Testing

Project Structure

TODOs

Citation

Acknowledgement

Comments

Owner

Zhuoqian Yang

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

The official implementation of Equalization Loss v1 & v2 (CVPR 2020, 2021) based on MMDetection.

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

Official PyTorch implementation of the preprint paper "Stylized Neural Painting", accepted to CVPR 2021.

Official PyTorch implementation of the paper "Deep Constrained Least Squares for Blind Image Super-Resolution", CVPR 2022.

Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face Manipulation" published in CVPR 2020.

This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

Source code for CVPR 2020 paper "Learning to Forget for Meta-Learning"

Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et al. 2020

Unofficial implementation of "TTNet: Real-time temporal and spatial video analysis of table tennis" (CVPR 2020)

Implementation of CVPR 2020 Dual Super-Resolution Learning for Semantic Segmentation

Poplar implementation of "Bundle Adjustment on a Graph Processor" (CVPR 2020)

UDP++ (ECCVW 2020 Oral), (Winner of COCO 2020 Keypoint Challenge).

The official pytorch implemention of the CVPR paper "Temporal Modulation Network for Controllable Space-Time Video Super-Resolution".

[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

PyTorch implementation of the Deep SLDA method from our CVPRW-2020 paper "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis"

A PyTorch Implementation of the paper - Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.