Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Last update: Dec 22, 2022

Related tags

Deep Learning VideoAutoencoder

Overview

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

This repository contains the code (in PyTorch) for the model introduced in the following paper:

Video Autoencoder: self-supervised disentanglement of 3D structure and motion
Zihang Lai, Sifei Liu, Alexi A. Efros, Xiaolong Wang
ICCV, 2021
[Paper] [Project Page] [12-min oral pres. video] [3-min supplemental video]

Citation

@inproceedings{Lai21a,
        title={Video Autoencoder: self-supervised disentanglement of 3D structure and motion},
        author={Lai, Zihang and Liu, Sifei and Efros, Alexei A and Wang, Xiaolong},
        booktitle={ICCV},
        year={2021}
}

Introduction
Data preparation
Training
Evaluation
Pretrained model

Introduction

We present Video Autoencoder for learning disentangled representations of 3D structure and camera pose from videos in a self-supervised manner. Relying on temporal continuity in videos, our work assumes that the 3D scene structure in nearby video frames remains static. Given a sequence of video frames as input, the Video Autoencoder extracts a disentangled representation of the scene including: (i) a temporally-consistent deep voxel feature to represent the 3D structure and (ii) a 3D trajectory of camera poses for each frame. These two representations will then be re-entangled for rendering the input video frames. Video Autoencoder can be trained directly using a pixel reconstruction loss, without any ground truth 3D or camera pose annotations. The disentangled representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following. We evaluate our method on several large-scale natural video datasets, and show generalization results on out-of-domain images.

Dependencies

The following dependencies are not strict - they are the versions that we use.

Python (3.8.5)
PyTorch (1.7.1)
CUDA 11.0
Python packages, install with pip install -r requirements.txt

Data preparation

RealEstate10K:

Download the dataset from RealEstate10K.
Download videos from RealEstate10K dataset, decode videos into frames. You might find the RealEstate10K_Downloader written by cashiwamochi helpful. Organize the data files into the following structure:

RealEstate10K/
    train/
        0000cc6d8b108390.txt
        00028da87cc5a4c4.txt
        ...
    test/
        000c3ab189999a83.txt
        000db54a47bd43fe.txt
        ...
dataset/
    train/
        0000cc6d8b108390/
            52553000.jpg
            52586000.jpg
            ...
        00028da87cc5a4c4/
            ...
    test/
        000c3ab189999a83/
        ...

Subsample the training set at one-third of the original frame-rate (so that the motion is sufficiently large). You can use scripts/subsample_dataset.py.
A list of videos ids that we used (10K for training and 5K for testing) is provided here:
1. Training video ids and testing video ids.
2. Note: as time changes, the availability of videos could change.

Matterport 3D (this could be tricky):

Install habitat-api and habitat-sim. You need to use the following repo version (see this SynSin issue for details):
1. habitat-sim: d383c2011bf1baab2ce7b3cd40aea573ad2ddf71
2. habitat-api: e94e6f3953fcfba4c29ee30f65baa52d6cea716e

Download the models from the Matterport3D dataset and the point nav datasets. You should have a dataset folder with the following data structure:

root_folder/
     mp3d/
         17DRP5sb8fy/
             17DRP5sb8fy.glb  
             17DRP5sb8fy.house  
             17DRP5sb8fy.navmesh  
             17DRP5sb8fy_semantic.ply
         1LXtFkjw3qL/
             ...
         1pXnuDYAj8r/
             ...
         ...
     pointnav/
         mp3d/
             ...

Walk-through videos for pretraining: We use a ShortestPathFollower function provided by the Habitat navigation package to generate episodes of tours of the rooms. See scripts/generate_matterport3d_videos.py for details.
Training and testing view synthesis pairs: we generally follow the same steps as the SynSin data instruction. The main difference is that we precompute all the image pairs. See scripts/generate_matterport3d_train_image_pairs.py and scripts/generate_matterport3d_test_image_pairs.py for details.

###Replica:

Testing view synthesis pairs: This procedure is similar to step 4 in Matterport3D - with only the specific dataset changed. See scripts/generate_replica_test_image_pairs.py for details.

Configurations

Finally, change the data paths in configs/dataset.yaml to your data location.

Pre-trained models

Pre-trained model (RealEstate10K): Link
Pre-trained model (Matterport3D): Link

Training:

Use this script:

CUDA_VISIBLE_DEVICES=0,1 python train.py --savepath log/train --dataset RealEstate10K

Some optional commands (w/ default value in square bracket):

Select dataset: --dataset [RealEstate10K]
Interval between clip frames: --interval [1]
Change clip length: --clip_length [6]
Increase/decrease lr step: --lr_adj [1.0]
For Matterport3D finetuning, you need to set --clip_length 2, because the data are pairs of images.

Evaluation:

1. Generate test results:

Use this script (for testing RealEstate10K):

CUDA_VISIBLE_DEVICES=0 python test_re10k.py --savepath log/model --resume log/model/checkpoint.tar --dataset RealEstate10K

or this script (for testing Matterport3D/Replica):

CUDA_VISIBLE_DEVICES=0 python test_mp3d.py --savepath log/model --resume log/model/checkpoint.tar --dataset Matterport3D

Some optional commands:

Select dataset: --dataset [RealEstate10K]
Max number of frames: --frame_limit [30]
Max number of sequences: --video_limit [100]
Use training set to evaluate: --train_set

Running this will generate a output folder where the results (videos and poses) save. If you want to visualize the pose, use packages for evaluation of odometry, such as evo. If you want to quantitatively evaluate the results, see 2.1, 2.2.

2.1 Quantitative Evaluation of synthesis results:

Use this script:

python eval_syn_re10k.py [OUTPUT_DIR] (for RealEstate10K)
python eval_syn_mp3d.py [OUTPUT_DIR] (for Matterport3D)

Optional commands:

Evaluate LPIPS: --lpips

2.2 Quantitative Evaluation of pose prediction results:

Use this script:

python eval_pose.py [POSE_DIR]

Contact

For any questions about the code or the paper, you can contact zihang.lai at gmail.com.

You might also like...

SatelliteSfM - A library for solving the satellite structure from motion problem

Satellite Structure from Motion Maintained by Kai Zhang. Overview This is a libr

190 Dec 8, 2022

This repository contains the code for the paper "Hierarchical Motion Understanding via Motion Programs"

Hierarchical Motion Understanding via Motion Programs (CVPR 2021) This repository contains the official implementation of: Hierarchical Motion Underst

40 Dec 5, 2022

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance (3DV2021)

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance This is the codebase for video-based human motion reconstruction in human-mot

5 Jul 14, 2022

《Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement》(ECCV 2020) GitHub: [fig9]

Unsupervised 3D Human Pose Representation [Paper] The implementation of our paper Unsupervised 3D Human Pose Representation with Viewpoint and Pose Di

42 Nov 24, 2022

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

59 Dec 28, 2022

Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Patch-Rotation(PatchRot) Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models Submitted to Neurips2021 To

4 Jul 12, 2021

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

282 Jan 9, 2023

Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Face Identity Disentanglement via Latent Space Mapping Description Official Implementation of the paper Face Identity Disentanglement via Latent Space

150 Dec 7, 2022

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

364 Dec 14, 2022

Comments

About Video Following

Fantastic work! Thank you for sharing the code!

I was trying to play around with a video following demo similar to the ones showing on your project page. The appearance image I'm using is Vincent van Gogh’s bedroom and the motion clip is trimmed from this video from 2:03 to 2:07. The models are resumed from the provided re10k.ckpt.

Since there's no script for a video following demo, I slightly changed test_re10k.py, within which the clip in line 85 is changed with the motion clip mentioned above. And the scene_rep in line 96 is changed with the encoding of the appearance image. Am I doing it right?

The result is not quite satisfying though. The trajectory estimated by the pose network is not correct. The appearance of the generated video also gets blurry quickly.

Could you provide any suggestions on how to perform video following? Such as the assets used (appearance image, motion clip and checkpoints) and some critical hyper-parameters (frame_limit, fps, etc.)

Thank you.

opened by dichen-cd 2
Bump pyyaml from 5.3.1 to 5.4
Bumps pyyaml from 5.3.1 to 5.4.

Changelog

Sourced from pyyaml's changelog.

5.4 (2021-01-19)

yaml/pyyaml#407 -- Build modernization, remove distutils, fix metadata, build wheels, CI to GHA

yaml/pyyaml#472 -- Fix for CVE-2020-14343, moves arbitrary python tags to UnsafeLoader

yaml/pyyaml#441 -- Fix memory leak in implicit resolver setup

yaml/pyyaml#392 -- Fix py2 copy support for timezone objects

yaml/pyyaml#378 -- Fix compatibility with Jython

Commits

58d0cb7 5.4 release

a60f7a1 Fix compatibility with Jython

ee98abd Run CI on PR base branch changes

ddf2033 constructor.timezone: _copy & deepcopy

fc914d5 Avoid repeatedly appending to yaml_implicit_resolvers

a001f27 Fix for CVE-2020-14343

fe15062 Add 3.9 to appveyor file for completeness sake

1e1c7fb Add a newline character to end of pyproject.toml

0b6b7d6 Start sentences and phrases for capital letters

c976915 Shell code improvements

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Generating Video Following having One Test Image

Hi, Thanks for the cool work.

I have a question regarding your test script. In Figure 2 (the test phase), it seems like the video is going to be entirely generated by having one input image. In the code, however, you compute the trajectory using the ground-truth video.

https://github.com/zlai0/VideoAutoencoder/blob/dc1aa14cde7da8c70e84f8cf7d4cc572a5ad9ed4/test_re10k.py#L102-L103

Is there some point that I am missing? And is generating a video having only one fixed image possible? (I know that for training you would necessarily need a video sequence, what I mean is at test time.)

Thanks.

opened by hmdolatabadi 0

Owner

Working from home

GitHub

CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes Implementation of CoSMA: Convolutional Semi-Regular Mesh Autoencoder arXiv p

10 Oct 11, 2022

Project looking into use of autoencoder for semi-supervised learning and comparing data requirements compared to supervised learning.

2 Dec 17, 2021

An SE(3)-invariant autoencoder for generating the periodic structure of materials

Crystal Diffusion Variational AutoEncoder This software implementes Crystal Diffusion Variational AutoEncoder (CDVAE), which generates the periodic st

94 Dec 10, 2022

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

92 Nov 30, 2022

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Related tags

Overview

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Citation

Contents

Introduction

Dependencies

Data preparation

RealEstate10K:

Matterport 3D (this could be tricky):

Configurations

Pre-trained models

Training:

Evaluation:

1. Generate test results:

2.1 Quantitative Evaluation of synthesis results:

2.2 Quantitative Evaluation of pose prediction results:

Contact

You might also like...

SatelliteSfM - A library for solving the satellite structure from motion problem

This repository contains the code for the paper "Hierarchical Motion Understanding via Motion Programs"

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance (3DV2021)

《Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement》(ECCV 2020) GitHub: [fig9]

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

Patch Rotation: A Self-Supervised Auxiliary Task for Robustness and Accuracy of Supervised Models

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

Comments

About Video Following

Bump pyyaml from 5.3.1 to 5.4

Generating Video Following having One Test Image

Owner

CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Project looking into use of autoencoder for semi-supervised learning and comparing data requirements compared to supervised learning.

An SE(3)-invariant autoencoder for generating the periodic structure of materials

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Implementation of Self-supervised Graph-level Representation Learning with Local and Global Structure (ICML 2021).

Self-Supervised Pillar Motion Learning for Autonomous Driving (CVPR 2021)

COLMAP - Structure-from-Motion and Multi-View Stereo

Making Structure-from-Motion (COLMAP) more robust to symmetries and duplicated structures

PyTorch implementation DRO: Deep Recurrent Optimizer for Structure-from-Motion

Deep Two-View Structure-from-Motion Revisited