The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

Ce Zheng

Last update: Dec 28, 2022

Related tags

Computer Vision PoseFormer

Overview

3D Human Pose Estimation with Spatial and Temporal Transformers

This repo is the official implementation for 3D Human Pose Estimation with Spatial and Temporal Transformers.

Video Demonstration

PoseFormer Architecture

Video Demo


3D HPE on Human3.6M


3D HPE on videos in-the-wild using PoseFormer

Our code is built on top of VideoPose3D.

Environment

The code is developed and tested under the following environment

Python 3.8.2
PyTorch 1.7.1
CUDA 11.0

You can create the environment:

conda env create -f poseformer.yml

Dataset

Our code is compatible with the dataset setup introduced by Martinez et al. and Pavllo et al.. Please refer to VideoPose3D to set up the Human3.6M dataset (./data directory).

Evaluating pre-trained models

We provide the pre-trained 81-frame model (CPN detected 2D pose as input) here. To evaluate it, put it into the ./checkpoint directory and run:

python run_poseformer.py -k cpn_ft_h36m_dbb -f 81 -c checkpoint --evaluate detected81f.bin

We also provide pre-trained 81-frame model (Ground truth 2D pose as input) here. To evaluate it, put it into the ./checkpoint directory and run:

python run_poseformer.py -k gt -f 81 -c checkpoint --evaluate gt81f.bin

Training new models

To train a model from scratch (CPN detected 2D pose as input), run:

python run_poseformer.py -k cpn_ft_h36m_dbb -f 27 -lr 0.0001 -lrd 0.99

-f controls how many frames are used as input. 27 frames achieves 47.0 mm, 81 frames achieves achieves 44.3 mm.

To train a model from scratch (Ground truth 2D pose as input), run:

python run_poseformer.py -k gt -f 81 -lr 0.0001 -lrd 0.99

81 frames achieves 31.3 mm (MPJPE).

Visualization and other functions

We keep our code consistent with VideoPose3D. Please refer to their project page for further information.

Bibtex

If you find our work useful in your research, please consider citing:

@article{zheng20213d,
title={3D Human Pose Estimation with Spatial and Temporal Transformers},
author={Zheng, Ce and Zhu, Sijie and Mendieta, Matias and Yang, Taojiannan and Chen, Chen and Ding, Zhengming},
journal={arXiv preprint arXiv:2103.10455},
year={2021}
}

Acknowledgement

Part of our code is borrowed from VideoPose3D. We thank the authors for releasing the codes.

Comments

Could you please provide the traning log?

I follow the setting as your paper and train the model with 27 frame, but I can't get the performance as your paper said. This is my traning log [1] time 20.00 lr 0.000200 3d_train 79.10 3d_valid 64.31 [2] time 19.75 lr 0.000196 3d_train 42.98 3d_valid 59.67 [3] time 19.77 lr 0.000192 3d_train 34.62 3d_valid 58.55 [4] time 19.88 lr 0.000188 3d_train 29.96 3d_valid 57.28 [5] time 19.83 lr 0.000184 3d_train 26.81 3d_valid 56.05 [6] time 19.86 lr 0.000181 3d_train 24.47 3d_valid 55.73 [7] time 19.82 lr 0.000177 3d_train 22.63 3d_valid 54.94 [8] time 19.88 lr 0.000174 3d_train 21.14 3d_valid 54.19 [9] time 19.84 lr 0.000170 3d_train 19.91 3d_valid 54.72 [10] time 19.86 lr 0.000167 3d_train 18.87 3d_valid 53.88 [11] time 19.77 lr 0.000163 3d_train 17.99 3d_valid 54.01 [12] time 19.74 lr 0.000160 3d_train 17.25 3d_valid 54.11 [13] time 19.68 lr 0.000157 3d_train 16.62 3d_valid 54.26 [14] time 19.74 lr 0.000154 3d_train 16.08 3d_valid 54.30 [15] time 19.66 lr 0.000151 3d_train 15.60 3d_valid 54.15 [16] time 19.80 lr 0.000148 3d_train 15.17 3d_valid 54.57 [17] time 19.85 lr 0.000145 3d_train 14.79 3d_valid 54.10 [18] time 19.95 lr 0.000142 3d_train 14.45 3d_valid 53.76 [19] time 19.80 lr 0.000139 3d_train 14.15 3d_valid 53.93 [20] time 19.84 lr 0.000136 3d_train 13.86 3d_valid 54.03 [21] time 19.84 lr 0.000134 3d_train 13.60 3d_valid 54.89 [22] time 19.86 lr 0.000131 3d_train 13.36 3d_valid 54.32 [23] time 19.87 lr 0.000128 3d_train 13.14 3d_valid 53.97 [24] time 19.82 lr 0.000126 3d_train 12.94 3d_valid 54.72 [25] time 19.78 lr 0.000123 3d_train 12.75 3d_valid 54.38 [26] time 19.86 lr 0.000121 3d_train 12.58 3d_valid 54.50 [27] time 19.85 lr 0.000118 3d_train 12.42 3d_valid 54.18 [28] time 19.82 lr 0.000116 3d_train 12.25 3d_valid 54.38 [29] time 19.81 lr 0.000114 3d_train 12.13 3d_valid 54.09 [30] time 19.83 lr 0.000111 3d_train 11.99 3d_valid 55.16 [31] time 19.82 lr 0.000109 3d_train 11.87 3d_valid 54.22 [32] time 19.84 lr 0.000107 3d_train 11.75 3d_valid 54.56 [33] time 19.81 lr 0.000105 3d_train 11.64 3d_valid 54.03 [34] time 19.84 lr 0.000103 3d_train 11.54 3d_valid 54.39 [35] time 19.81 lr 0.000101 3d_train 11.43 3d_valid 55.32 [36] time 19.80 lr 0.000099 3d_train 11.33 3d_valid 54.34 [37] time 19.81 lr 0.000097 3d_train 11.24 3d_valid 55.05 [38] time 19.80 lr 0.000095 3d_train 11.15 3d_valid 54.58 [39] time 19.85 lr 0.000093 3d_train 11.07 3d_valid 54.72 [40] time 19.79 lr 0.000091 3d_train 10.99 3d_valid 55.01 [41] time 19.74 lr 0.000089 3d_train 10.91 3d_valid 54.71 [42] time 19.81 lr 0.000087 3d_train 10.83 3d_valid 54.52 [43] time 19.83 lr 0.000086 3d_train 10.76 3d_valid 54.47 [44] time 19.87 lr 0.000084 3d_train 10.69 3d_valid 54.51 [45] time 19.86 lr 0.000082 3d_train 10.63 3d_valid 54.54 [46] time 19.80 lr 0.000081 3d_train 10.56 3d_valid 55.12 [47] time 19.86 lr 0.000079 3d_train 10.50 3d_valid 55.35 [48] time 19.78 lr 0.000077 3d_train 10.44 3d_valid 55.37 [49] time 19.68 lr 0.000076 3d_train 10.39 3d_valid 54.62 [50] time 19.70 lr 0.000074 3d_train 10.32 3d_valid 55.17

opened by flyyyyer 13
Learning Curves

Hi, Could you provide the Train and test Learning Curves for 27f, or when it could get ~47mm, I try to run your code: python run_poseformer.py -k cpn_ft_h36m_dbb -f 27 -lr 0.00004 -lrd 0.99 util epoch 28, the 3d_valid is just 51.88954. Sorry because my GPU too weak, it takes me too long time.

Thanks, best regards.

opened by hankhuynh1011 7
ValueError on Custom Video Visualization

ValueError: non-broadcastable output operand with shape (1621,1,17,3) doesn't match the broadcast shape (1621,1621,17,3)

$ python run_poseformer.py -k cpn_ft_h36m_dbb -c checkpoint --evaluate pretrained_h36m_cpn.bin --render --viz-subject S1 --viz-action Directions --viz-camera 0 --viz-video "/home/user/Desktop/Test/PoseFormer-main/data/Video/Directions.mp4" --viz-output output.gif --viz-size 3 --viz-downsample 2 --viz-limit 60

Loading dataset... Preparing data... Loading 2D detections... INFO: Receptive field: 81 frames INFO: Trainable parameter count: 9602885 Loading checkpoint checkpoint/pretrained_h36m_cpn.bin INFO: Testing on 225676 frames Rendering... Traceback (most recent call last): File "/home/user/anaconda3/envs/pose2/lib/python3.8/runpy.py", line 193, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/user/anaconda3/envs/pose2/lib/python3.8/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/user/.vscode/extensions/ms-python.python-2021.9.1246542782/pythonFiles/lib/python/debugpy/main.py", line 45, in cli.main() File "/home/user/.vscode/extensions/ms-python.python-2021.9.1246542782/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main run() File "/home/user/.vscode/extensions/ms-python.python-2021.9.1246542782/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file runpy.run_path(target_as_str, run_name=compat.force_str("main")) File "/home/user/anaconda3/envs/pose2/lib/python3.8/runpy.py", line 263, in run_path return _run_module_code(code, init_globals, run_name, File "/home/user/anaconda3/envs/pose2/lib/python3.8/runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "/home/user/anaconda3/envs/pose2/lib/python3.8/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/user/Desktop/Test/PoseFormer-main/run_poseformer.py", line 606, in prediction += trajectory ValueError: non-broadcastable output operand with shape (1621,1,17,3) doesn't match the broadcast shape (1621,1621,17,3)

opened by nies14 6
Unable to resolve anaconda packages to setup environment
When running

$ conda env create -f poseformer.yml

I get this output:

(base) D:\Desktop\PoseFormer-main>conda env create -f poseformer.yml Collecting package metadata (repodata.json): done Solving environment: failed

ResolvePackageNotFound:

pyyaml==5.3.1=py38h7b6447c_1

mkl-service==2.3.0=py38he904b0f_0

xorg-xproto==7.0.31=h14c3975_1007

cudatoolkit==11.0.221=h6bb024c_0

libffi==3.2.1=hf484d3e_1007

wrapt==1.12.1=py38h7b6447c_1

matplotlib-base==3.3.2=py38h91b0d89_0

nss==3.57=he751ad9_0

libedit==3.1.20191231=h14c3975_1

readline==8.0=h7b6447c_0

tensorflow-base==2.2.0=mkl_py38h5059a2d_0

pillow==7.2.0=py38hb39fc2d_0

libtiff==4.1.0=h2733197_1

gst-plugins-base==1.14.5=h0935bb2_2

tornado==6.0.4=py38h7b6447c_1

expat==2.2.9=he6710b0_2

jpeg==9d=h516909a_0

lz4-c==1.9.2=he6710b0_1

nspr==4.29=he1b5a44_0

kiwisolver==1.2.0=py38hbf85e49_0

scipy==1.5.2=py38h0b6359f_0

fontconfig==2.13.1=h1056068_1002

psutil==5.7.2=py38h7b6447c_0

libwebp-base==1.1.0=h7b6447c_3

mysql-libs==8.0.21=hf3661c5_2

gmp==6.2.0=he1b5a44_2

python==3.8.2=hcf32534_0

cffi==1.14.0=py38h2e261b9_0

qt==5.12.9=h1f2b2cb_0

libsodium==1.0.18=h7b6447c_0

mkl_random==1.1.1=py38h0573a6f_0

libevent==2.1.10=hcdb4288_2

jasper==1.900.1=hd497a04_4

krb5==1.17.1=h173b8e3_0

freetype==2.10.2=h5ab3b9f_0

libllvm10==10.0.1=hbcb73fb_5

openssl==1.1.1i=h27cfd23_0

libuuid==2.32.1=h14c3975_1000

libpng==1.6.37=hbc83047_0

pcre==8.44=he6710b0_0

libiconv==1.16=h516909a_0

pyzmq==19.0.2=py38he6710b0_1

xorg-kbproto==1.0.7=h14c3975_1002

tensorflow==2.2.0=mkl_py38h6d3daf0_0

brotlipy==0.7.0=py38h7b6447c_1000

certifi==2020.12.5=py38h06a4308_0

pytorch==1.7.1=py3.8_cuda11.0.221_cudnn8.0.5_0

aiohttp==3.7.2=py38h1e0a361_0

sqlite==3.33.0=h62c20be_0

lame==3.100=h7b6447c_0

libgcc-ng==9.1.0=hdf63c60_0

ld_impl_linux-64==2.33.1=h53a641e_7

libxcb==1.14=h7b6447c_0

nettle==3.4.1=hbb512f6_0

xorg-libx11==1.6.12=h516909a_0

icu==67.1=he1b5a44_0

cairo==1.16.0=h3fc0475_1005

pixman==0.38.0=h7b6447c_0

ca-certificates==2021.1.19=h06a4308_0

libxkbcommon==0.10.0=he1b5a44_0

ninja==1.10.1=py38hfd86e86_0

xz==5.2.5=h7b6447c_0

zeromq==4.3.2=he6710b0_3

numpy-base==1.19.2=py38hfa32c7d_0

gnutls==3.6.13=h79a8f9a_0

libprotobuf==3.13.0=hd408876_0

dbus==1.13.16=hb2f20db_0

glib==2.63.1=h5a9c865_0

libuv==1.40.0=h7b6447c_0

libpq==12.3=h5513abc_0

gstreamer==1.14.5=h36ae1b5_2

multidict==4.7.5=py38h1e0a361_2

xorg-libice==1.0.10=h516909a_0

protobuf==3.13.0=py38he6710b0_1

mkl_fft==1.2.0=py38h23d657b_0

libstdcxx-ng==9.1.0=hdf63c60_0

yarl==1.6.2=py38h1e0a361_0

cython==0.29.21=py38he6710b0_0

x264==1!152.20180806=h7b6447c_0

libxml2==2.9.10=h68273f3_2

lcms2==2.11=h396b838_0

grpcio==1.33.2=py38heead2fc_0

harfbuzz==2.7.2=hee91db6_0

zstd==1.4.5=h9ceee32_0

h5py==2.10.0=py38hd6299e0_1

bzip2==1.0.8=h7b6447c_0

pandas==1.1.1=py38he6710b0_0

c-ares==1.16.1=h516909a_3

tk==8.6.10=hbc83047_0

xorg-xextproto==7.3.0=h14c3975_1002

yaml==0.2.5=h7b6447c_0

hdf5==1.10.6=hb1b8bf9_0

libclang==10.0.1=default_hb85057a_2

xorg-libsm==1.2.3=h84519dc_1000

xorg-libxext==1.3.4=h516909a_0

xorg-libxrender==0.9.10=h516909a_1002

openh264==2.1.1=h8b12597_0

xorg-renderproto==0.11.1=h14c3975_1002

libgfortran-ng==7.3.0=hdf63c60_0

ncurses==6.2=he6710b0_1

cryptography==3.1.1=py38h1ba5d50_0

zlib==1.2.11=h7b6447c_3

graphite2==1.3.14=h23475e2_0

Is there a way to fix this? Thanks!
opened by fortminors 3
Question about preprocessing MPII_INF_3DHP?

could you give me the details about preprocessing the MPII_INF_3DHP dataset? I follow this script https://github.com/mkocabas/VIBE/blob/master/lib/data_utils/mpii3d_utils.py to preprocess the dataset and train with lr=0.0004, batch_size=512, frame=9, I got much smaller results(MPJPE:50mm) than your paper. is it corrected?

opened by leechaonan 2
question about preprocessing

I notice you use "norm_coordinate_screnn" to preprocess data. But I don't know whether the coordinates of the original points are absolute coordinate on the original image or not. Because I notice your method just processes a single human pose, the original coordinate before "norm_coordinate_screnn" is just absolute coordinate or processed coordinate?

opened by shoutOutYangJie 1
RuntimeError: CUDA out of memory

Trying to run the pre-trained model & training new models...Getting bellow error

RuntimeError: CUDA out of memory. Tried to allocate 366.00 MiB (GPU 0; 7.79 GiB total capacity; 4.50 GiB already allocated; 303.38 MiB free; 4.62 GiB reserved in total by PyTorch)

Reducing batch size also doesn't work

Processor: Core i9 Ram: 64 GB GPU: RTX 3070

The Traceback: ' Traceback (most recent call last): File "run_poseformer.py", line 309, in predicted_3d_pos = model_pos_train(inputs_2d) File "/home/user/anaconda3/envs/pose2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/user/anaconda3/envs/pose2/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/user/anaconda3/envs/pose2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/user/Desktop/Research/PoseFormer-main/common/model_poseformer.py", line 178, in forward x = self.Spatial_forward_features(x) File "/home/user/Desktop/Research/PoseFormer-main/common/model_poseformer.py", line 154, in Spatial_forward_features x = blk(x) File "/home/user/anaconda3/envs/pose2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/user/Desktop/Research/PoseFormer-main/common/model_poseformer.py", line 81, in forward x = x + self.drop_path(self.attn(self.norm1(x))) File "/home/user/anaconda3/envs/pose2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/user/Desktop/Research/PoseFormer-main/common/model_poseformer.py", line 56, in forward attn = (q @ k.transpose(-2, -1)) * self.scale RuntimeError: CUDA out of memory. Tried to allocate 366.00 MiB (GPU 0; 7.79 GiB total capacity; 4.50 GiB already allocated; 303.38 MiB free; 4.62 GiB reserved in total by PyTorch) '

opened by nies14 0
Question about 'spatial pose embedding'

Dear author,

Thanks for your great work! I have a simple question waiting for your reply.

I saw you just create a learnable parameter initialized with zero and indicate as spatial pose embedding.

self.Spatial_pos_embed = nn.Parameter(torch.zeros(1, num_joints, embed_dim_ratio))

I wonder how it works to retain positional information of the sequence?

opened by zhywanna 1
The test results are of the wrong order of magnitude

When i run the command "python run_poseformer.py -k cpn_ft_h36m_dbb -f 81 -c checkpoint --evaluate detected81f.bin" to evaluate the pre-trained 81-frame model, it produces the following result

Is there a bug in this code please?

opened by PJ-Hunter 1
Visualization of self-attentions

Hi I wanna ask how to get predicted outputs (weight i, j) in Fig. 5. Does it mean Softmax(Query j dot product Key i / dimension^0.5) or Softmax(Query j dot product Key i / dimension^0.5) Value i? but the latter one is a vector.

thx

opened by erikervalid 1

The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

Related tags

Overview

3D Human Pose Estimation with Spatial and Temporal Transformers

PoseFormer Architecture

Video Demo

Environment

Dataset

Evaluating pre-trained models

Training new models

Visualization and other functions

Bibtex

Acknowledgement

Comments

Owner

Ce Zheng

Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Implementation of our paper 'PixelLink: Detecting Scene Text via Instance Segmentation' in AAAI2018

This is the code for our paper DAAIN: Detection of Anomalous and AdversarialInput using Normalizing Flows

Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

Code release for our paper, "SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo"

code for our ICCV 2021 paper "DeepCAD: A Deep Generative Network for Computer-Aided Design Models"

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering" official PyTorch implementation.

An official PyTorch implementation of the paper "Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences", ICCV 2021.

kaldi-asr/kaldi is the official location of the Kaldi project.

Hand gesture detection project with aweome UI implementation.

Official implementation of Character Region Awareness for Text Detection (CRAFT)

Official PyTorch implementation for "Mixed supervision for surface-defect detection: from weakly to fully supervised learning"

[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

A PyTorch implementation of ECCV2018 Paper: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

An Implementation of the seglink alogrithm in paper Detecting Oriented Text in Natural Images by Linking Segments

This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"