Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.

NVIDIA Corporation

Last update: Jan 1, 2023

Related tags

Deep Learning vid2vid

Overview

vid2vid

Project | YouTube(short) | YouTube(full) | arXiv | Paper(full)

Pytorch implementation for high-resolution (e.g., 2048x1024) photorealistic video-to-video translation. It can be used for turning semantic label maps into photo-realistic videos, synthesizing people talking from edge maps, or generating human motions from poses. The core of video-to-video translation is image-to-image translation. Some of our work in that space can be found in pix2pixHD and SPADE.

Video-to-Video Synthesis
Ting-Chun Wang¹, Ming-Yu Liu¹, Jun-Yan Zhu², Guilin Liu¹, Andrew Tao¹, Jan Kautz¹, Bryan Catanzaro¹
¹NVIDIA Corporation, ²MIT CSAIL
In Neural Information Processing Systems (NeurIPS) 2018

Video-to-Video Translation

Label-to-Streetview Results

Edge-to-Face Results

Pose-to-Body Results

Frame Prediction Results

Prerequisites

Linux or macOS
Python 3
NVIDIA GPU + CUDA cuDNN
PyTorch 0.4

Getting Started

Installation

Install python libraries dominate and requests.

pip install dominate requests

If you plan to train with face datasets, please install dlib.

pip install dlib

If you plan to train with pose datasets, please install DensePose and/or OpenPose.
Clone this repo:

git clone https://github.com/NVIDIA/vid2vid
cd vid2vid

Docker Image If you have difficulty building the repo, a docker image can be found in the docker folder.

Testing

Please first download example dataset by running python scripts/download_datasets.py.
Next, compile a snapshot of FlowNet2 by running python scripts/download_flownet2.py.
Cityscapes
- Please download the pre-trained Cityscapes model by:
```
python scripts/street/download_models.py
```
- To test the model (bash ./scripts/street/test_2048.sh):
```
#!./scripts/street/test_2048.sh
python test.py --name label2city_2048 --label_nc 35 --loadSize 2048 --n_scales_spatial 3 --use_instance --fg --use_single_G
```
  The test results will be saved in: ./results/label2city_2048/test_latest/.
- We also provide a smaller model trained with single GPU, which produces slightly worse performance at 1024 x 512 resolution.
  - Please download the model by
```
python scripts/street/download_models_g1.py
```
  - To test the model (bash ./scripts/street/test_g1_1024.sh):
```
#!./scripts/street/test_g1_1024.sh
python test.py --name label2city_1024_g1 --label_nc 35 --loadSize 1024 --n_scales_spatial 3 --use_instance --fg --n_downsample_G 2 --use_single_G
```
- You can find more example scripts in the scripts/street/ directory.

Faces

Please download the pre-trained model by:
```
python scripts/face/download_models.py
```

To test the model (bash ./scripts/face/test_512.sh):

#!./scripts/face/test_512.sh
python test.py --name edge2face_512 --dataroot datasets/face/ --dataset_mode face --input_nc 15 --loadSize 512 --use_single_G

The test results will be saved in: ./results/edge2face_512/test_latest/.

Dataset

Cityscapes
- We use the Cityscapes dataset as an example. To train a model on the full dataset, please download it from the official website (registration required).
- We apply a pre-trained segmentation algorithm to get the corresponding semantic maps (train_A) and instance maps (train_inst).
- Please add the obtained images to the datasets folder in the same way the example images are provided.
Face
- We use the FaceForensics dataset. We then use landmark detection to estimate the face keypoints, and interpolate them to get face edges.
Pose
- We use random dancing videos found on YouTube. We then apply DensePose / OpenPose to estimate the poses for each frame.

Training with Cityscapes dataset

First, download the FlowNet2 checkpoint file by running python scripts/download_models_flownet2.py.

Training with 8 GPUs:

We adopt a coarse-to-fine approach, sequentially increasing the resolution from 512 x 256, 1024 x 512, to 2048 x 1024.
Train a model at 512 x 256 resolution (bash ./scripts/street/train_512.sh)

#!./scripts/street/train_512.sh
python train.py --name label2city_512 --label_nc 35 --gpu_ids 0,1,2,3,4,5,6,7 --n_gpus_gen 6 --n_frames_total 6 --use_instance --fg

Train a model at 1024 x 512 resolution (must train 512 x 256 first) (bash ./scripts/street/train_1024.sh):

#!./scripts/street/train_1024.sh
python train.py --name label2city_1024 --label_nc 35 --loadSize 1024 --n_scales_spatial 2 --num_D 3 --gpu_ids 0,1,2,3,4,5,6,7 --n_gpus_gen 4 --use_instance --fg --niter_step 2 --niter_fix_global 10 --load_pretrain checkpoints/label2city_512

If you have TensorFlow installed, you can see TensorBoard logs in ./checkpoints/label2city_1024/logs by adding --tf_log to the training scripts.

Training with a single GPU:
- We trained our models using multiple GPUs. For convenience, we provide some sample training scripts (train_g1_XXX.sh) for single GPU users, up to 1024 x 512 resolution. Again a coarse-to-fine approach is adopted (256 x 128, 512 x 256, 1024 x 512). Performance is not guaranteed using these scripts.
- For example, to train a 256 x 128 video with a single GPU (bash ./scripts/street/train_g1_256.sh)
```
#!./scripts/street/train_g1_256.sh
python train.py --name label2city_256_g1 --label_nc 35 --loadSize 256 --use_instance --fg --n_downsample_G 2 --num_D 1 --max_frames_per_gpu 6 --n_frames_total 6
```
Training at full (2k x 1k) resolution
- To train the images at full resolution (2048 x 1024) requires 8 GPUs with at least 24G memory (bash ./scripts/street/train_2048.sh). If only GPUs with 12G/16G memory are available, please use the script ./scripts/street/train_2048_crop.sh, which will crop the images during training. Performance is not guaranteed with this script.

Training with face datasets

If you haven't, please first download example dataset by running python scripts/download_datasets.py.
Run the following command to compute face landmarks for training dataset:
```
python data/face_landmark_detection.py train
```

Run the example script (bash ./scripts/face/train_512.sh)

python train.py --name edge2face_512 --dataroot datasets/face/ --dataset_mode face --input_nc 15 --loadSize 512 --num_D 3 --gpu_ids 0,1,2,3,4,5,6,7 --n_gpus_gen 6 --n_frames_total 12

For single GPU users, example scripts are in train_g1_XXX.sh. These scripts are not fully tested and please use at your own discretion. If you still hit out of memory errors, try reducing max_frames_per_gpu.
More examples scripts can be found in scripts/face/.
Please refer to More Training/Test Details for more explanations about training flags.

Training with pose datasets

If you haven't, please first download example dataset by running python scripts/download_datasets.py.
Example DensePose and OpenPose results are included. If you plan to use your own dataset, please generate these results and put them in the same way the example dataset is provided.

Run the example script (bash ./scripts/pose/train_256p.sh)

python train.py --name pose2body_256p --dataroot datasets/pose --dataset_mode pose --input_nc 6 --num_D 2 --resize_or_crop ScaleHeight_and_scaledCrop --loadSize 384 --fineSize 256 --gpu_ids 0,1,2,3,4,5,6,7 --batchSize 8 --max_frames_per_gpu 3 --no_first_img --n_frames_total 12 --max_t_step 4

Again, for single GPU users, example scripts are in train_g1_XXX.sh. These scripts are not fully tested and please use at your own discretion. If you still hit out of memory errors, try reducing max_frames_per_gpu.
More examples scripts can be found in scripts/pose/.
Please refer to More Training/Test Details for more explanations about training flags.

Training with your own dataset

If your input is a label map, please generate label maps which are one-channel whose pixel values correspond to the object labels (i.e. 0,1,...,N-1, where N is the number of labels). This is because we need to generate one-hot vectors from the label maps. Please use --label_nc N during both training and testing.
If your input is not a label map, please specify --input_nc N where N is the number of input channels (The default is 3 for RGB images).
The default setting for preprocessing is scaleWidth, which will scale the width of all training images to opt.loadSize (1024) while keeping the aspect ratio. If you want a different setting, please change it by using the --resize_or_crop option. For example, scaleWidth_and_crop first resizes the image to have width opt.loadSize and then does random cropping of size (opt.fineSize, opt.fineSize). crop skips the resizing step and only performs random cropping. scaledCrop crops the image while retraining the original aspect ratio. randomScaleHeight will randomly scale the image height to be between opt.loadSize and opt.fineSize. If you don't want any preprocessing, please specify none, which will do nothing other than making sure the image is divisible by 32.

More Training/Test Details

We generate frames in the video sequentially, where the generation of the current frame depends on previous frames. To generate the first frame for the model, there are 3 different ways:
- 1. Using another generator which was trained on generating single images (e.g., pix2pixHD) by specifying --use_single_G. This is the option we use in the test scripts.
- 1. Using the first frame in the real sequence by specifying --use_real_img.
- 1. Forcing the model to also synthesize the first frame by specifying --no_first_img. This must be trained separately before inference.
The way we train the model is as follows: suppose we have 8 GPUs, 4 for generators and 4 for discriminators, and we want to train 28 frames. Also, assume each GPU can generate only one frame. The first GPU generates the first frame, and pass it to the next GPU, and so on. After the 4 frames are generated, they are passed to the 4 discriminator GPUs to compute the losses. Then the last generated frame becomes input to the next batch, and the next 4 frames in the training sequence are loaded into GPUs. This is repeated 7 times (4 x 7 = 28), to train all the 28 frames.
Some important flags:
- n_gpus_gen: the number of GPUs to use for generators (while the others are used for discriminators). We separate generators and discriminators into different GPUs since when dealing with high resolutions, even one frame cannot fit in a GPU. If the number is set to -1, there is no separation and all GPUs are used for both generators and discriminators (only works for low-res images).
- n_frames_G: the number of input frames to feed into the generator network; i.e., n_frames_G - 1 is the number of frames we look into the past. the default is 3 (conditioned on previous two frames).
- n_frames_D: the number of frames to feed into the temporal discriminator. The default is 3.
- n_scales_spatial: the number of scales in the spatial domain. We train from the coarsest scale and all the way to the finest scale. The default is 3.
- n_scales_temporal: the number of scales for the temporal discriminator. The finest scale takes in the sequence in the original frame rate. The coarser scales subsample the frames by a factor of n_frames_D before feeding the frames into the discriminator. For example, if n_frames_D = 3 and n_scales_temporal = 3, the discriminator effectively sees 27 frames. The default is 3.
- max_frames_per_gpu: the number of frames in one GPU during training. If you run into out of memory error, please first try to reduce this number. If your GPU memory can fit more frames, try to make this number bigger to make training faster. The default is 1.
- max_frames_backpropagate: the number of frames that loss backpropagates to previous frames. For example, if this number is 4, the loss on frame n will backpropagate to frame n-3. Increasing this number will slightly improve the performance, but also cause training to be less stable. The default is 1.
- n_frames_total: the total number of frames in a sequence we want to train with. We gradually increase this number during training.
- niter_step: for how many epochs do we double n_frames_total. The default is 5.
- niter_fix_global: if this number if not 0, only train the finest spatial scale for this number of epochs before starting to fine-tune all scales.
- batchSize: the number of sequences to train at a time. We normally set batchSize to 1 since often, one sequence is enough to occupy all GPUs. If you want to do batchSize > 1, currently only batchSize == n_gpus_gen is supported.
- no_first_img: if not specified, the model will assume the first frame is given and synthesize the successive frames. If specified, the model will also try to synthesize the first frame instead.
- fg: if specified, use the foreground-background separation model as stated in the paper. The foreground labels must be specified by --fg_labels.
- no_flow: if specified, do not use flow warping and directly synthesize frames. We found this usually still works reasonably well when the background is static, while saving memory and training time.
- sparse_D: if specified, only apply temporal discriminator on sparse frames in the sequence. This helps save memory while having little effect on performance.
For other flags, please see options/train_options.py and options/base_options.py for all the training flags; see options/test_options.py and options/base_options.py for all the test flags.
Additional flags for edge2face examples:
- no_canny_edge: do not use canny edges for background as input.
- no_dist_map: by default, we use distrance transform on the face edge map as input. This flag will make it directly use edge maps.
Additional flags for pose2body examples:
- densepose_only: use only densepose results as input. Please also remember to change input_nc to be 3.
- openpose_only: use only openpose results as input. Please also remember to change input_nc to be 3.
- add_face_disc: add an additional discriminator that only works on the face region.
- remove_face_labels: remove densepose results for face, and add noise to openpose face results, so the network can get more robust to different face shapes. This is important if you plan to do inference on half-body videos (if not, usually this flag is unnecessary).
- random_drop_prob: the probability to randomly drop each pose segment during training, so the network can get more robust to missing poses at inference time. Default is 0.05.
- basic_point_only: if specified, only use basic joint keypoints for OpenPose output, without using any hand or face keypoints.

Citation

If you find this useful for your research, please cite the following paper.

@inproceedings{wang2018vid2vid,
   author    = {Ting-Chun Wang and Ming-Yu Liu and Jun-Yan Zhu and Guilin Liu
                and Andrew Tao and Jan Kautz and Bryan Catanzaro},
   title     = {Video-to-Video Synthesis},
   booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},   
   year      = {2018},
}

Acknowledgments

We thank Karan Sapra, Fitsum Reda, and Matthieu Le for generating the segmentation maps for us. We also thank Lisa Rhee for allowing us to use her dance videos for training. We thank William S. Peebles for proofreading the paper.
This code borrows heavily from pytorch-CycleGAN-and-pix2pix and pix2pixHD.

Comments

TypeError: __init__() got an unexpected keyword argument 'track_running_stats'

i have installed this repo in nvidia docker env with: cuda8.0, cudnn6.0, miniconda with python3.6 virtualenv, pytorch0.2.0

when i run ./scripts/test_2048.sh this shell, i got beblow error:

------------ Options ------------- aspect_ratio: 1.0 batchSize: 1 checkpoints_dir: ./checkpoints dataroot: datasets/Cityscapes/ dataset_mode: temporal debug: False display_id: 0 display_winsize: 512 feat_num: 3 fg: True fg_labels: [26] fineSize: 512 gpu_ids: [0] how_many: 300 input_nc: 3 isTrain: False label_feat: False label_nc: 35 loadSize: 2048 load_features: False load_pretrain: max_dataset_size: inf model: vid2vid nThreads: 2 n_blocks: 9 n_blocks_local: 3 n_downsample_E: 3 n_downsample_G: 3 n_frames_G: 3 n_gpus_gen: 1 n_local_enhancers: 1 n_scales_spatial: 3 name: label2city_2048 ndf: 64 nef: 32 netE: simple netG: composite ngf: 128 no_first_img: False no_flip: False norm: batch ntest: inf output_nc: 3 phase: test resize_or_crop: scaleWidth results_dir: ./results/ serial_batches: False tf_log: False use_instance: True use_real_img: False use_single_G: True which_epoch: latest -------------- End ---------------- CustomDatasetDataLoader dataset [TestDataset] was created vid2vid ---------- Networks initialized -------------

Traceback (most recent call last): File "test.py", line 24, in model = create_model(opt) File "/vid2vid/models/models.py", line 19, in create_model modelG.initialize(opt) File "/vid2vid/models/vid2vid_model_G.py", line 51, in initialize self.netG_i = self.load_single_G() if self.use_single_G else None File "/vid2vid/models/vid2vid_model_G.py", line 270, in load_single_G netG = networks.define_G(input_nc, opt.output_nc, 0, 32, 'local', 4, 'instance', 0, self.gpu_ids, opt) File "/vid2vid/models/networks.py", line 39, in define_G netG = LocalEnhancer(input_nc, output_nc, ngf, n_downsampling, opt.n_blocks, opt.n_local_enhancers, opt.n_blocks_local, norm_layer) File "/vid2vid/models/networks.py", line 320, in init model_global = GlobalGenerator(input_nc, output_nc, ngf_global, n_downsample_global, n_blocks_global, norm_layer).model File "/vid2vid/models/networks.py", line 286, in init model = [nn.ReflectionPad2d(3), nn.Conv2d(input_nc, ngf, kernel_size=7, padding=0), norm_layer(ngf), activation] TypeError: init() got an unexpected keyword argument 'track_running_stats' (py3) root@0d93b7e85c1e:/vid2vid#

can anyone tell me how to solve it.

thks!

opened by zzzkk2009 15
ModuleNotFoundError: No module named 'resample2d_cuda'

Hi,

I faced an issue, 'ModuleNotFoundError: No module named 'resample2d_cuda'. Do you know how to solve this?

'resample2d_package' folder contains as below. D:\download\vid2vid\models\flownet2_pytorch\networks\resample2d_package
pycache init.py resample2d.py resample2d_cuda.cc resample2d_kernel.cu resample2d_kernel.cuh setup.py

Following is the cmd command.

D:\download\vid2vid>python test.py --name label2city_2048 --dataroot datasets/Cityscapes/test_A --loadSize 2048 --n_scales_spatial 3 --use_instance --fg --use_single_G ------------ Options ------------- aspect_ratio: 1.0 batchSize: 1 checkpoints_dir: ./checkpoints dataroot: datasets/Cityscapes/test_A dataset_mode: temporal debug: False display_id: 0 display_winsize: 512 feat_num: 3 fg: True fg_labels: [26] fineSize: 512 gpu_ids: [0] how_many: 300 input_nc: 3 isTrain: False label_feat: False label_nc: 35 loadSize: 2048 load_features: False load_pretrain: max_dataset_size: inf model: vid2vid nThreads: 2 n_blocks: 9 n_blocks_local: 3 n_downsample_E: 3 n_downsample_G: 3 n_frames_G: 3 n_gpus_gen: 1 n_local_enhancers: 1 n_scales_spatial: 3 name: label2city_2048 ndf: 64 nef: 32 netE: simple netG: composite ngf: 128 no_first_img: False no_flip: False norm: batch ntest: inf output_nc: 3 phase: test resize_or_crop: scaleWidth results_dir: ./results/ serial_batches: False tf_log: False use_instance: True use_real_img: False use_single_G: True which_epoch: latest -------------- End ---------------- CustomDatasetDataLoader dataset [TestDataset] was created vid2vid Traceback (most recent call last): File "test.py", line 24, in model = create_model(opt) File "D:\download\vid2vid\models\models.py", line 7, in create_model from .vid2vid_model_G import Vid2VidModelG File "D:\download\vid2vid\models\vid2vid_model_G.py", line 13, in from . import networks File "D:\download\vid2vid\models\networks.py", line 12, in from .flownet2_pytorch.networks.resample2d_package.resample2d import Resample2d File "D:\download\vid2vid\models\flownet2_pytorch\networks\resample2d_package\resample2d.py", line 3, in import resample2d_cuda ModuleNotFoundError: No module named 'resample2d_cuda'

opened by TatsuyaOsugi 13
Download scripts no longer working

ModuleNotFoundError: No module named 'scripts.download_gdrive'

They need to be updated as the scripts are now in specific sub folders. Also, I believe an init.py file is required in the scripts folder.

opened by fniroui 11
Understanding the flow for training faces

Please correct me if I am wrong. ( I am focusing just on faces)

As I understand, vid2vid lets you provide a video from which each frame is like labeled data for training. So once one has a trained model, then given any input data of just edge-maps, then vid2vid will try to create a face (based on the trained data) from the edge maps.

I am not clear though how to do this with train.py. Do I need to generate edge-maps myself for each frame of my video?

Ideally I want to just provide vid2vid a single say .avi or video file and vid2vid generate edge-maps itself for each frame, outputs a trained model.

Thank you @tcwang0509 @junyanz

When answering, please include CLI commands that I can copy paste/directions that I can immediately do/changes to Python code that might be needed.

opened by fxfactorial 9
Using model on own data set.

Hi this is an extremely interesting project, I have another set of data set that I would like to use. The data I want to use are images that are taken in 15 minute increments and I want to transform them into another image. We have used pix2pix for this problem set, but we also wanted to see if vid2vid will yield better results. I want to know how I need to have the images to be formatted or organized before training on them? Any help would be much appreciated.

opened by yuanzhou15 8

ModuleNotFoundError: No module named 'models.flownet2_pytorch.networks.resample2d_package._ext'

Failed to test vid2vid...

➜  vid2vid git:(master) python test.py --name label2city_2048 --label_nc 35 --loadSize 2048 --n_scales_spatial 3 --use_instance --fg --use_single_G
------------ Options -------------
add_face_disc: False
aspect_ratio: 1.0
batchSize: 1
checkpoints_dir: ./checkpoints
dataroot: datasets/Cityscapes/
dataset_mode: temporal
debug: False
densepose_only: False
display_id: 0
display_winsize: 512
feat_num: 3
fg: True
fg_labels: [26]
fineSize: 512
gpu_ids: [0]
how_many: 300
input_nc: 3
isTrain: False
label_feat: False
label_nc: 35
loadSize: 2048
load_features: False
load_pretrain: 
max_dataset_size: inf
model: vid2vid
nThreads: 2
n_blocks: 9
n_blocks_local: 3
n_downsample_E: 3
n_downsample_G: 3
n_frames_G: 3
n_gpus_gen: 1
n_local_enhancers: 1
n_scales_spatial: 3
name: label2city_2048
ndf: 64
nef: 32
netE: simple
netG: composite
ngf: 128
no_canny_edge: False
no_dist_map: False
no_first_img: False
no_flip: False
no_flow: False
norm: batch
ntest: inf
openpose_only: False
output_nc: 3
phase: test
random_drop_prob: 0.2
remove_face_labels: False
resize_or_crop: scaleWidth
results_dir: ./results/
serial_batches: False
start_frame: 0
tf_log: False
use_instance: True
use_real_img: False
use_single_G: True
which_epoch: latest
-------------- End ----------------
CustomDatasetDataLoader
dataset [TestDataset] was created
vid2vid
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    model = create_model(opt)
  File "....../vid2vid/models/models.py", line 7, in create_model
    from .vid2vid_model_G import Vid2VidModelG
  File "....../vid2vid/models/vid2vid_model_G.py", line 13, in <module>
    from . import networks
  File "....../vid2vid/models/networks.py", line 12, in <module>
    from .flownet2_pytorch.networks.resample2d_package.resample2d import Resample2d
ModuleNotFoundError: No module named 'models.flownet2_pytorch'

opened by jiapei100 6

spatio-temporal growing

Hey, in your paper you mention in the experiment section: "Implementation details. We train our network in a spatio-temporally progressive manner. In particular, we start with generating low-resolution videos with few frames, and all the way up to generating full resolution videos with 30 (or more) frames."

How exactly did you do the scaling? I looked through your code, but couldn't find anything related to it. In particular, I would like to know whether you increase both spatial/temporal size at the same time, or one after another, and whether you adjusted other hparams when using it.

What I mean by adjusting hparams is that this progressive growing is mainly used to reduce train time I guess, so if the model would usually take N epochs to train on high res vids from scratch, you probably trained it on M<N epochs per progressive growing stage right? Then what is M? N/#stages? And did you use larger batch size for smaller resolutions, or make any other notable hparam changes?

opened by fa9r 5
Training a face model with train_g1_256.sh
Did anyone successfully trained a model with train_g1_256.sh? The readme says that single GPU models were to well tested. For some reasons the training finishes after only a few hours and trying to test I get:

ISSUE: Pretrained network G0 has fewer layers; The following are not initialized: ['model_down_img', 'model_down_seg', 'model_final_flow', 'model_final_img', 'model_final_w', 'model_res_flow', 'model_res_img', 'model_up_flow', 'model_up_img']
opened by petergerten 5

'Segmentation fault (core dumped)' still exists.

I downloaded the latest version (Aug 24th), and followed the steps in 'Read me' Paragraph 'Test', but I still get such error. My CUDA version is 9.0, PyTorch version is 4.1.

Can someone help me?

------------ Options -------------
aspect_ratio: 1.0
batchSize: 1
checkpoints_dir: ./checkpoints
dataroot: datasets/Cityscapes/test_A
dataset_mode: temporal
debug: False
display_id: 0
display_winsize: 512
feat_num: 3
fg: True
fg_labels: [26]
fineSize: 512
gpu_ids: [1, 2]
how_many: 300
input_nc: 3
isTrain: False
label_feat: False
label_nc: 35
loadSize: 2048
load_features: False
load_pretrain: 
max_dataset_size: inf
model: vid2vid
nThreads: 2
n_blocks: 9
n_blocks_local: 3
n_downsample_E: 3
n_downsample_G: 3
n_frames_G: 3
n_gpus_gen: 2
n_local_enhancers: 1
n_scales_spatial: 3
name: label2city_2048
ndf: 64
nef: 32
netE: simple
netG: composite
ngf: 128
no_first_img: False
no_flip: False
norm: batch
ntest: inf
output_nc: 3
phase: test
resize_or_crop: scaleWidth
results_dir: ./results/
serial_batches: False
tf_log: False
use_instance: True
use_real_img: False
use_single_G: True
which_epoch: latest
-------------- End ----------------
CustomDatasetDataLoader
dataset [TestDataset] was created
vid2vid
---------- Networks initialized -------------
-----------------------------------------------
Doing 28 frames
Segmentation fault (core dumped)

opened by joe1chief 5

Add Dockerfile and launch script

I hope this docker image is useful to others! Would have saved me a couple days of trial and error work fighting competing dependencies

I've tested this Dockerfile so far as it can run bash ./scripts/test_1024_g1.sh. I have a single GPU setup so I can't test any multi-GPU features.

If I encounter further issues as I use vid2vid, I'll update this pull request.

opened by dustinfreeman 4

Sometimes ran into RuntimeError: Given groups=1, weight of size [64, 18, 7, 7]... when training.

After I train the model with follow parameters:

python train.py --name pose \
--dataroot datasets/pose --dataset_mode pose \
--input_nc 6 --ngf 64 --num_D 2 \
--resize_or_crop scaleHeight_and_scaledCrop --loadSize 288 --fineSize 256 \
--niter 5 --niter_decay 5 \
--n_frames_total 20 --max_t_step 4 \
--max_frames_per_gpu 8

Logs

(epoch: 8, iters: 18006, time: 4.986) D_T_fake0: 0.064 D_T_fake1: 0.228 D_T_real0: 0.167 D_T_real1: 0.120 D_fake: 0.161 D_real: 0.498 G_GAN: 2.842 G_GAN_Feat: 5.241 G_T_GAN>
(epoch: 8, iters: 18106, time: 5.215) D_T_fake0: 0.038 D_T_fake1: 0.124 D_T_real0: 0.072 D_T_real1: 0.155 D_fake: 0.494 D_real: 0.542 G_GAN: 2.327 G_GAN_Feat: 5.192 G_T_GAN>
(epoch: 8, iters: 18206, time: 5.250) D_T_fake0: 0.050 D_T_fake1: 0.053 D_T_real0: 0.139 D_T_real1: 0.108 D_fake: 0.361 D_real: 0.640 G_GAN: 2.527 G_GAN_Feat: 4.626 G_T_GAN>
(epoch: 8, iters: 18306, time: 5.295) D_T_fake0: 0.019 D_T_fake1: 0.229 D_T_real0: 0.278 D_T_real1: 0.049 D_fake: 0.573 D_real: 0.695 G_GAN: 2.337 G_GAN_Feat: 4.622 G_T_GAN>

Traceback

Traceback (most recent call last):
  File "train.py", line 148, in <module>
    train()
  File "train.py", line 55, in train
    fake_B, fake_B_raw, flow, weight, real_A, real_Bp, fake_B_last = modelG(input_A, input_B, inst_A, fake_B_prev_last)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/vid2vid/models/models.py", line 37, in forward
    outputs = self.model(*inputs, **kwargs, dummy_bs=self.pad_bs)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/vid2vid/models/vid2vid_model_G.py", line 133, in forward
    fake_B, fake_B_raw, flow, weight = self.generate_frame_train(netG, real_A_all, fake_B_prev, start_gpu, is_first_frame)
  File "/vid2vid/models/vid2vid_model_G.py", line 178, in generate_frame_train
    fake_B_feat, flow_feat, fake_B_fg_feat, use_raw_only)
  File "/vid2vid/models/networks.py", line 204, in forward
    downsample = self.model_down_seg(input) + self.model_down_img(img_prev)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [64, 18, 7, 7], expected input[1, 12, 262, 198] to have 18 channels, but got 12 channels instead

I've tried to continue train this model but it still happended after 10000~100000 iterations.

opened by sheiun 3

RuntimeError: DataLoader worker (pid(s) 22100) exited unexpectedly

Dear and Near,

I tried all the possibilities. But no luck. Could anyone help me to figure out the issue? Your help is very much appreciated!

I am using windows 11 Pc with NVIDIA GeForce RTX 3090 GPU

------------ Options -------------
add_face_disc: False
aspect_ratio: 1.0
basic_point_only: False
batchSize: 1
checkpoints_dir: ./checkpoints
dataroot: datasets/Cityscapes/
dataset_mode: temporal
debug: False
densepose_only: False
display_id: 0
display_winsize: 512
feat_num: 3
fg: True
fg_labels: [26]
fineSize: 512
fp16: False
gpu_ids: [0]
how_many: 300
input_nc: 3
isTrain: False
label_feat: False
label_nc: 35
loadSize: 1024
load_features: False
load_pretrain: 
local_rank: 0
max_dataset_size: inf
model: vid2vid
nThreads: 2
n_blocks: 9
n_blocks_local: 3
n_downsample_E: 3
n_downsample_G: 2
n_frames_G: 3
n_gpus_gen: 1
n_local_enhancers: 1
n_scales_spatial: 3
name: label2city_1024_g1
ndf: 64
nef: 32
netE: simple
netG: composite
ngf: 128
no_canny_edge: False
no_dist_map: False
no_first_img: False
no_flip: False
no_flow: False
norm: batch
ntest: inf
openpose_only: False
output_nc: 3
phase: test
random_drop_prob: 0.05
random_scale_points: False
remove_face_labels: False
resize_or_crop: scaleWidth
results_dir: ./results/
serial_batches: False
start_frame: 0
tf_log: False
use_instance: True
use_real_img: False
use_single_G: True
which_epoch: latest
-------------- End ----------------
CustomDatasetDataLoader
dataset [TestDataset] was created
vid2vid
---------- Networks initialized -------------
-----------------------------------------------
Doing 28 frames
[]
Num GPUs Available:  0
2022-09-07 18:43:37.812230: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-07 18:43:37.818042: I tensorflow/compiler/xla/service/service.cc:170] XLA service 0x2d8390eac00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-09-07 18:43:37.818151: I tensorflow/compiler/xla/service/service.cc:178]   StreamExecutor device (0): Host, Default Version
[0]
Device ID (unmasked): 0
Device ID (masked): 0
a+b=42
2022-09-07 18:43:37.823294: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
------------ Options -------------
add_face_disc: False
aspect_ratio: 1.0
basic_point_only: False
batchSize: 1
checkpoints_dir: ./checkpoints
dataroot: datasets/Cityscapes/
dataset_mode: temporal
debug: False
densepose_only: False
display_id: 0
display_winsize: 512
feat_num: 3
fg: True
fg_labels: [26]
fineSize: 512
fp16: False
gpu_ids: [0]
how_many: 300
input_nc: 3
isTrain: False
label_feat: False
label_nc: 35
loadSize: 1024
load_features: False
load_pretrain: 
local_rank: 0
max_dataset_size: inf
model: vid2vid
nThreads: 2
n_blocks: 9
n_blocks_local: 3
n_downsample_E: 3
n_downsample_G: 2
n_frames_G: 3
n_gpus_gen: 1
n_local_enhancers: 1
n_scales_spatial: 3
name: label2city_1024_g1
ndf: 64
nef: 32
netE: simple
netG: composite
ngf: 128
no_canny_edge: False
no_dist_map: False
no_first_img: False
no_flip: False
no_flow: False
norm: batch
ntest: inf
openpose_only: False
output_nc: 3
phase: test
random_drop_prob: 0.05
random_scale_points: False
remove_face_labels: False
resize_or_crop: scaleWidth
results_dir: ./results/
serial_batches: False
start_frame: 0
tf_log: False
use_instance: True
use_real_img: False
use_single_G: True
which_epoch: latest
-------------- End ----------------
CustomDatasetDataLoader
dataset [TestDataset] was created
vid2vid
---------- Networks initialized -------------
-----------------------------------------------
Doing 28 frames
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "D:\Python_Code\venv\vid2vid-master\test.py", line 75, in <module>
    for i, data in enumerate(dataset):
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\site-packages\torch\utils\data\dataloader.py", line 368, in __iter__
    return self._get_iterator()
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\site-packages\torch\utils\data\dataloader.py", line 314, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\site-packages\torch\utils\data\dataloader.py", line 927, in __init__
    w.start()
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\site-packages\torch\utils\data\dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\multiprocessing\queues.py", line 108, in get
    raise Empty
_queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:/Python_Code/venv/vid2vid-master/test.py", line 75, in <module>
    for i, data in enumerate(dataset):
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\site-packages\torch\utils\data\dataloader.py", line 530, in __next__
    data = self._next_data()
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\site-packages\torch\utils\data\dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\site-packages\torch\utils\data\dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File "C:\ProgramData\Miniconda3\envs\tf2.4\lib\site-packages\torch\utils\data\dataloader.py", line 1024, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 22100) exited unexpectedly

opened by am-official 0

Errors when running on CPU without CUDA

When I set --gpu_id=-1, I start getting errors from the models when they are initialized like

Traceback (most recent call last):
  File "train.py", line 148, in <module>
    train()
  File "train.py", line 28, in train
    models = create_model(opt)
  File "C:\Users\moish\Desktop\vid2vid\models\models.py", line 76, in create_model
    modelG.initialize(opt)
  File "C:\Users\moish\Desktop\vid2vid\models\vid2vid_model_G.py", line 59, in initialize
    self.n_frames_per_gpu = min(self.opt.max_frames_per_gpu, self.opt.n_frames_total // self.n_gpus) # number of frames in each GPU
ZeroDivisionError: integer division or modulo by zero

When then setting --n_gpus_gen 1 it gets through that error but then comes

Traceback (most recent call last):
  File "train.py", line 148, in <module>
    train()
  File "train.py", line 28, in train
    models = create_model(opt)
  File "C:\Users\moish\Desktop\vid2vid\models\models.py", line 78, in create_model
    modelD.initialize(opt)
  File "C:\Users\moish\Desktop\vid2vid\models\vid2vid_model_D.py", line 22, in initialize
    self.gpu_ids = ([opt.gpu_ids[0]] + opt.gpu_ids[gpu_split_id:]) if opt.n_gpus_gen != len(opt.gpu_ids) else opt.gpu_ids
IndexError: list index out of range

Forcefully setting it to -1 fixes it but then gives me

Traceback (most recent call last):
  File "train.py", line 148, in <module>
    train()
  File "train.py", line 28, in train
    models = create_model(opt)
  File "C:\Users\moish\Desktop\vid2vid\models\models.py", line 78, in create_model
    modelD.initialize(opt)
  File "C:\Users\moish\Desktop\vid2vid\models\vid2vid_model_D.py", line 37, in initialize
    opt.num_D, not opt.no_ganFeat, gpu_ids=self.gpu_ids)
  File "C:\Users\moish\Desktop\vid2vid\models\networks.py", line 66, in define_D
    netD.cuda(gpu_ids[0])
  File "C:\Users\moish\.conda\envs\vid2vid\lib\site-packages\torch\nn\modules\module.py", line 304, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "C:\Users\moish\.conda\envs\vid2vid\lib\site-packages\torch\nn\modules\module.py", line 201, in _apply
    module._apply(fn)
  File "C:\Users\moish\.conda\envs\vid2vid\lib\site-packages\torch\nn\modules\module.py", line 201, in _apply
    module._apply(fn)
  File "C:\Users\moish\.conda\envs\vid2vid\lib\site-packages\torch\nn\modules\module.py", line 223, in _apply
    param_applied = fn(param)
  File "C:\Users\moish\.conda\envs\vid2vid\lib\site-packages\torch\nn\modules\module.py", line 304, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: Device index must not be negative

I'm stuck y'all.

opened by PinPointPing 0

RuntimeError: Legacy autograd function with non-static forward method is deprecated.
When running

python train.py --name edge2face_512 --dataroot datasets/face/ --dataset_mode face --input_nc 15 --loadSize 256 --num_D 1 --max_frames_per_gpu 2 --n_frames_total 6

I am getting this error: I am using pytorch 1.9.0 and torchvision 0.10.0.

@Note: I tried to compile using pytorch0.4.1 but unable to compile resample2d_cuda. Even after running install.sh I get import error

import resample2d_cuda ImportError: libc10.so: cannot open shared object file: No such file or directory
opened by avani17101 2
RuntimeError: CUDA error: throwing an instance of 'c10::Error'
Hi all,

I'm getting this error while training vid2vid;

RuntimeError: CUDA error: the launch timed out and was terminated terminate called after throwing an instance of 'c10::Error'

The error pops up when the script starts the validation phase; here is the entire log; log.txt

I'm training vid2vid on the Kitti dataset using a server with 8 X 12GB Quadro M6000 GPUs. I'm also attaching the config file; ampO1.txt

Any suggestions?

Thanks :)
opened by alamayreh 1
How to test “pose-to-body”？

thanks for sharing！ If I finish training pose-to-body, how to run test.py, what is the python command？ And where will the weights be saved after training

opened by guofengming11 0

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.

Related tags

Overview

vid2vid

Project | YouTube(short) | YouTube(full) | arXiv | Paper(full)

Video-to-Video Translation

Prerequisites

Getting Started

Installation

Testing

Dataset

Training with Cityscapes dataset

Training with face datasets

Training with pose datasets

Training with your own dataset

More Training/Test Details

Citation

Acknowledgments

Comments

Owner

NVIDIA Corporation

The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

Pytorch implementation of "ARM: Any-Time Super-Resolution Method"

PyTorch implementation of the Deep SLDA method from our CVPRW-2020 paper "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis"

PyTorch implementation of our method for adversarial attacks and defenses in hyperspectral image classification.

[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

Implementation of 'lightweight' GAN, proposed in ICLR 2021, in Pytorch. High resolution image generations that can be trained within a day or two

This is an official pytorch implementation of Lite-HRNet: A Lightweight High-Resolution Network.

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Official PyTorch implementation of "VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization" (CVPR 2021)

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Unofficial pytorch implementation of the paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution"

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

PyTorch code for our paper "Image Super-Resolution with Non-Local Sparse Attention" (CVPR2021).

PyTorch code for our ECCV 2020 paper "Single Image Super-Resolution via a Holistic Attention Network"