Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation

Last update: Dec 30, 2022

Related tags

Deep Learning computer-vision video-processing deeplearning video-editing image-inpainting video-inpainting object-removal

Overview

Implicit Internal Video Inpainting

Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation

paper | project website | 4K data | demo video

Introduction

Want to remove objects from a video without days of training and thousands of training videos? Try our simple but effective internal video inpainting method. The inpainting process is zero-shot and implicit, which does not need any pretraining on large datasets or optical-flow estimation. We further extend the proposed method to more challenging tasks: video object removal with limited annotated masks, and inpainting on ultra high-resolution videos (e.g., 4K videos).

TO DO

Release code for 4K video inpainting

Setup

Installation

git clone https://github.com/Tengfei-Wang/Implicit-Internal-Video-Inpainting.git
cd Implicit-Internal-Video-Inpainting

Environment

This code is based on tensorflow 2.x (tested on tensorflow 2.2, 2.4).

The environment can be simply set up by Anaconda:

conda create -n IIVI python=3.7
conda activate IIVI
conda install tensorflow-gpu tensorboard
pip install pyaml 
pip install opencv-python
pip install tensorflow-addons

Or, you can also set up the environment from the provided environment.yml:

conda env create -f environment.yml
conda activate IIVI

Usage

Quick Start

We provide an example sequence 'bmx-trees' in ./inputs/ . To try our method:

python train.py

The default iterations is set to 50,000 in config/train.yml, and the internal learning takes ~4 hours with a single GPU. During the learning process, you can use tensorboard to check the inpainting results by:

tensorboard --logdir ./exp/logs

After the training, the final results can be saved in ./exp/results/ by:

python test.py

You can also modify 'model_restore' in config/test.yml to save results with different checkpoints.

Try Your Own Data

Data preprocess

Before training, we advise to dilate the object masks first to exclude some edge pixels. Otherwise, the imperfectly-annotated masks would lead to artifacts in the object removal task.

You can generate and preprocess the masks by this script:

python scripts/preprocess_mask.py --annotation_path inputs/annotations/bmx-trees

Basic training

Modify the config/train.yml, which indicates the video path, log path, and training iterations,etc.. The training iterations depends on the video length, and it typically takes 30,000 ~ 80,000 iterations for convergence for 100-frame videos. By default, we only use reconstruction loss for training, and it works well for most cases.

python train.py

Improve the sharpness and consistency

For some hard videos, the former training may not produce a pleasing result. You can fine-tune the trained model with another losses. To this end, modify the 'model_restore' in config/test.yml to the checkpoint path of basic training. Also set ambiguity_loss or stabilization_loss to True. Then fine-tune the basic checkpoint for 20,000-40,000 iterations.

python train.py

Inference

Modify the ./config/test.yml, which indicates the video path, log path, and save path.

python test.py

Mask Propagation from A Single Frame

When you only annotate the object mask of one frame (or few frames), our method can propagate it to other frames automatically.

Modify ./config/train_mask.yml. We typically set the training iterations to 4,000 ~ 20,000, and the learning rate to 1e-5 ~ 1e-4.

python train_mask.py

After training, modify ./config/test_mask.yml, and then:

python test_mask.py

High-resolution Video Inpainting

Our 4K videos and mask annotations can be downloaded in 4K data.

More Results

Our results on 70 DAVIS videos (including failure cases) can be found here for your reference :)
If you need the PNG version of our uncompressed results, please contact the authors.

Citation

If you find this work useful for your research, please cite:

@inproceedings{ouyang2021video,
  title={Internal Video Inpainting by Implicit Long-range Propagation},
  author={Ouyang, Hao and Wang, Tengfei and Chen, Qifeng},
  booktitle={International Conference on Computer Vision (ICCV) },
  year={2021}
}

If you are also interested in the image inpainting or internal learning, this paper can be also helpful :)

@inproceedings{wang2021image,
  title={Image Inpainting with External-internal Learning and Monochromic Bottleneck},
  author={Wang, Tengfei and Ouyang, Hao and Chen, Qifeng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5120--5129},
  year={2021}
}

Contact

Please send emails to Hao Ouyang or Tengfei Wang if there is any question

Comments

About the pipeline

Hi, thanks for your released code.

It takes too long to train&test a video, will it possible to fastly test any input? The example "bmx-trees" takes several hours to finish.

Thanks.

opened by sydney0zq 6
the resolution problem of saving the result's picture

Hello, thank you for the code. And I can train and test the bmx-trees datasets. But the resolution(320, 600) of the picture result is different from the input (480, 854). I found that the train.yml and test.yml file has the parameterimg_shapes: [320, 600]. But when I try to change it to img_shapes: [480, 854] to retrain, the following error occurred: tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [5,480,856,3] vs. [5,480,854,1] [Op:Mul]. So if I want to save the inpainting image with the same resolution as the input image, what should I do?

opened by weiningwei 1
Reduce time for inference

As for each video that I want to repaint I must re-train, what parameters do you recommend changing to decrease the algorithm's training times (4 hours is too much, I would like to achieve 1h <). Thanks a lot in advance for the help and the code!

opened by italosalgado14 1
Virtualenv users support

Hi there. Thank you very much for the code!. Could you generate the "requirements.txt" file for venv users? This gives us more performance and control over the CUDA versions for users with newer RTX 30XX cards. Thank you very much in advance!

opened by italosalgado14 1
GPU out of memory when set ambiguity_loss or stabilization_loss to True

Hi, I am trying to run your code with ambiguity loss and stabilization loss. But I met the gpu out of memory problem.

May I ask what is the batch size you set in the experiments with ambiguity loss and stabilization loss and what kind of gpu and how many gpus are used to train the model?

Many thanks!

opened by Huihui1002 1
PNG version of our uncompressed results and segmentation results
Hi,

Many thanks for publishing the code for this nice work. I am very interested in your work.

May I ask could you please share

the PNG version of results

the segmentation results

inpainting results with only the first frame segmentation mask

of all the videos in DAVIS dataset?

Many thanks!
opened by Huihui1002 1
error when using train_dist.py

when I use tensorflow 2.4： AttributeError: 'MirroredStrategy' object has no attribute 'experimental_run_v2' I need to downgrade to tensorflow 2.0

when I use tensorflow 2.0： ImportError:cannot import name 'keras_tensor' from 'tensorflow.python.keras.engine' (/home/ivdai/anaconda3/envs/IIVI/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/init.py) I need to upgrade to tensorflow 2.4

funny

opened by Altheim 1
About test

Hi, I‘m a little confused about this method. If I want to test 10 different videos, should I train 10 different models? and every model needs about 4 hours? or just need to train a model which can be applied to other videos?

opened by lixixin 1
What "Mask Propagation from A Single Frame" usage?
How much annotation files I am must provide?

Must I am provide corresponding mask for them or not?

Why annotation pictures red and green and what difference?

How much frames I am must provide for video?

Is video must be without fast object movement?
opened by Vadim2S 1
multi-GPUs - only using vram, not processing

Hi Tengfei Wang, such a amazing reasearch and many thanks for sharing the code. Very intersting results...

I was able to reproduce some results and really liked the work flow you created of CNN and not Oflow, seams it handles perspective shifts and background better (still playing with it). The dilate mask makes totally sense...

My question is about multi-GPU to speed up training....im doing these below:

on train.py i removed the # on mirrored_strategy = tf.distribute.MirroredStrategy() line and added # on os.environ["CUDA_VISIBLE_DEVICES"] = FLAGS.GPU_ID.

With that seams that is Training is using both GPUs, but also shows that the GPU_0 is using CUDA and processing but GPU_1 only using vram, does not seams to be using CUDA and process, only VRAM. Is that correct?

Also saw @tf.function down below, but not sure if i should remove # on those lines. Also found #dist_full_ds = mirrored_strategy, tried but seams to do the same thing on second gpu, only using vram, not processing

Is that correct behavior?

Thank you Tengfei Wang and once again, amazing research.

opened by optfx 1
Dataset directory for training

@ken-ouyang @Tengfei-Wang Thank you for releasing the codebase. Could you please share what the exact data directory should look like for training your model on a video? I am training on a random youtube video. So, do I need to create a separate frames directory and mask directory, where each file in the mask folder corresponds to an image file in the frames directory?

opened by SURABHI-GUPTA 0
I haven't understand your network.

Hi, Thanks for providing code. I look at your code, I find you train one video, and then use the same to do the inference. I think it is tricky. The CNN should be train with multiple videos, and then use different video to do inference. Same video to train and same video to inference, it is of cause produce a good result.

I try to understand your model. Is kind like the input video has an foreground object and its mask, then, you give another mask for augmentation. Then, after training, the output video will have no foreground and the background is re-drawed. Am I right?

Can I train your model with multiple videos then inference with a different video? For example, I want to train 10 different videos, then I inference with another different video? What will be happened on the inference? How can I train the model with multiple videos?

opened by ztrobertyang 1
the single gpu infer for multi-gpu train

Hello, thank you for the code. Now I can train the model using multi-gpu. But when I use single gpu to infer it, the following error occurred: tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for exp/logs/city_day/0_3/checkpoint_200000 So if I use multi-GPU to train, Do I have to use multi-GPU to infer?

opened by weiningwei 0
4K pipeline and performance

Hello,

Great work. I would like to test the 4K pipeline. Could you provide a sample or some hints on how long it took to train? Thank you in advance. Barnabas

opened by BarnabasTakacs 0

Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation

Related tags

Overview

Implicit Internal Video Inpainting

Introduction

TO DO

Setup

Installation

Environment

Usage

Quick Start

Try Your Own Data

Data preprocess

Basic training

Improve the sharpness and consistency

Inference

Mask Propagation from A Single Frame

High-resolution Video Inpainting

More Results

Citation

Contact

Comments

Owner

My implementation of Image Inpainting - A deep learning Inpainting model

AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)

POPPY (Physical Optics Propagation in Python) is a Python package that simulates physical optical propagation including diffraction

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

Code for the AAAI-2022 paper: Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification

Implementation of the master's thesis "Temporal copying and local hallucination for video inpainting".

Exploring Classification Equilibrium in Long-Tailed Object Detection, ICCV2021

Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

The pytorch implementation of the paper "text-guided neural image inpainting" at MM'2020

[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

[CVPR 2021] MiVOS - Mask Propagation module. Reproduced STM (and better) with training code :star2:. Semi-supervised video object segmentation evaluation.

BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment

[AAAI22] Reliable Propagation-Correction Modulation for Video Object Segmentation

Official repository of "BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment"

PyTorch implementation of 1712.06087 "Zero-Shot" Super-Resolution using Deep Internal Learning