[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

Rex Cheng

Last update: Jan 3, 2023

Related tags

Deep Learning computer-vision deep-learning pytorch segmentation video-segmentation interactive-segmentation video-object-segmentation cvpr2021

Overview

Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion (MiVOS)

Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang

CVPR 2021

[arXiv] [Paper PDF] [Project Page] [Demo] [Papers with Code]

_{^{Credit (left to right): DAVIS 2017, Academy of Historical Fencing, Modern History TV}}

We manage the project using three different repositories (which are actually in the paper title). This is the main repo, see also Mask-Propagation and Scribble-to-Mask.

Overall structure and capabilities

	MiVOS	Mask-Propagation	Scribble-to-Mask
DAVIS/YouTube semi-supervised evaluation	❌	✔️	❌
DAVIS interactive evaluation	✔️	❌	❌
User interaction GUI tool	✔️	❌	❌
Dense Correspondences	❌	✔️	❌
Train propagation module	❌	✔️	❌
Train S2M (interaction) module	❌	❌	✔️
Train fusion module	✔️	❌	❌
Generate more synthetic data	✔️	❌	❌

Framework

Requirements

We used these packages/versions in the development of this project. It is likely that higher versions of the same package will also work. This is not an exhaustive list -- other common python packages (e.g. pillow) are expected and not listed.

PyTorch 1.7.1
torchvision 0.8.2
OpenCV 4.2.0
Cython
progressbar
davis-interactive (https://github.com/albertomontesg/davis-interactive)
PyQt5 for GUI
networkx 2.4 for DAVIS
gitpython for training
gdown for downloading pretrained models

Refer to the official PyTorch guide for installing PyTorch/torchvision. The rest can be installed by:

pip install PyQt5 davisinteractive progressbar2 opencv-python networkx gitpython gdown Cython

Quick start

python download_model.py to get all the required models.
python interactive_gui.py --video or python interactive_gui.py --images . A video has been prepared for you at examples/example.mp4.
If you need to label more than one object, additionally specify --num_objects
There are instructions in the GUI. You can also watch the demo videos for some ideas.

Main Results

DAVIS/YouTube semi-supervised results

DAVIS Interactive Track

All results are generated using the unmodified official DAVIS interactive bot without saving masks (--save_mask not specified) and with an RTX 2080Ti. We follow the official protocol.

Precomputed result, with the json summary: [Google Drive] [OneDrive]

eval_interactive_davis.py

Model	AUC-J&F	J&F @ 60s
Baseline	86.0	86.6
(+) Top-k	87.2	87.8
(+) BL30K pretraining	87.4	88.0
(+) Learnable fusion	87.6	88.2
(+) Difference-aware fusion (full model)	87.9	88.5

Pretrained models

python download_model.py should get you all the models that you need. (pip install gdown required.)

[OneDrive Mirror]

Training

Data preparation

Datasets should be arranged as the following layout. You can use download_datasets.py (same as the one Mask-Propagation) to get the DAVIS dataset and manually download and extract fusion_data ([OneDrive]) and BL30K.

├── BL30K
├── DAVIS
│   └── 2017
│       ├── test-dev
│       │   ├── Annotations
│       │   └── ...
│       └── trainval
│           ├── Annotations
│           └── ...
├── fusion_data
└── MiVOS

BL30K

BL30K is a synthetic dataset rendered using Blender with ShapeNet's data. We break the dataset into six segments, each with approximately 5K videos. The videos are organized in a similar format as DAVIS and YouTubeVOS, so dataloaders for those datasets can be used directly. Each video is 160 frames long, and each frame has a resolution of 768*512. There are 3-5 objects per video, and each object has a random smooth trajectory -- we tried to optimize the trajectories greedily to minimize object intersection (not guaranteed), with occlusions still possible (happen a lot in reality). See generation/blender/generate_yaml.py for details.

We noted that using probably half of the data is sufficient to reach full performance (although we still used all), but using less than one-sixth (5K) is insufficient.

Download

You can either use the automatic script download_bl30k.py or download it manually below. Note that each segment is about 115GB in size -- 700GB in total. You are going to need ~1TB of free disk space to run the script (including extraction buffer).

Google Drive is much faster in my experience. Your mileage might vary.

Manual download: [Google Drive] [OneDrive]

Generation

Download ShapeNet.
Install Blender. (We used 2.82)
Download a bunch of background and texture images. We used this repo (we specified "non-commercial reuse" in the script) and the list of keywords are provided in generation/blender/*.json.
Generate a list of configuration files (generation/blender/generate_yaml.py).
Run rendering on the configurations. See here (Not documented in detail, ask if you have a question)

Fusion data

We use the propagation module to run through some data and obtain real outputs to train the fusion module. See the script generate_fusion.py.

Or you can download pre-generated fusion data:

Training commands

These commands are to train the fusion module only.

CUDA_VISIBLE_DEVICES=[a,b] OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port [cccc] --nproc_per_node=2 train.py --id [defg] --stage [h]

We implemented training with Distributed Data Parallel (DDP) with two 11GB GPUs. Replace a, b with the GPU ids, cccc with an unused port number, defg with a unique experiment identifier, and h with the training stage (0/1).

The model is trained progressively with different stages (0: BL30K; 1: DAVIS). After each stage finishes, we start the next stage by loading the trained weight. A pretrained propagation model is required to train the fusion module.

One concrete example is:

Pre-training on the BL30K dataset: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 7550 --nproc_per_node=2 train.py --load_prop saves/propagation_model.pth --stage 0 --id retrain_s0

Main training: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 7550 --nproc_per_node=2 train.py --load_prop saves/propagation_model.pth --stage 1 --id retrain_s012 --load_network [path_to_trained_s0.pth]

Credit

f-BRS: https://github.com/saic-vul/fbrs_interactive_segmentation

ivs-demo: https://github.com/seoungwugoh/ivs-demo

deeplab: https://github.com/VainF/DeepLabV3Plus-Pytorch

STM: https://github.com/seoungwugoh/STM

BlenderProc: https://github.com/DLR-RM/BlenderProc

Citation

Please cite our paper if you find this repo useful!

@inproceedings{MiVOS_2021,
  title={Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion},
  author={Cheng, Ho Kei and Tai, Yu-Wing and Tang, Chi-Keung},
  booktitle={CVPR},
  year={2021}
}

Contact: [email protected]

Comments

Some problem when train Fusion

Hello, I encountered some problems when retraining the fusion model. Some key parameter guidelines for training fusion are not given in the code warehouse. Can you provide it? Specifically as follows: (1) generate_fusion.py: parameter "separation" not given

Can you provide the relevant parameter descriptions of fusion training and the instructions to run so that I can reproduce the results of your paper?

and when I try to train（python train.py）,I meet some code mistake in fusion_dataset.py：（1）are there some mistake When you assign a value to self.vid_to_instance？ and It will return error at： self.videos = [v for v in self.videos if v in self.vid_to_instance]（line 60 in fusion_datast.py）

opened by nazimii 12
Process killed

I tried the MIVOS + STCN on a 1.5 minute 4k video that was down sampled to 480p and the program crashed.

What are the steps to reformat/sample a 4k video to make it work for this tool?

Also can this tool run on multiple GPUs?

opened by zdhernandez 11
Fine-tune guidance

Hi really loved the work, I'm trying to fine-tune the downloaded models(using the downlaod_model.py) to another domain. I was wondering if you could help me where to put the data and which command to run the training.

Thank you

opened by be-redAsmara 8
" not implemented for 'BFloat16'(example.mp4)">

RuntimeError: "slow_conv_dilated<>" not implemented for 'BFloat16'(example.mp4)

Hello! I followed the instructions of Quickstart with these settings: python interactive_gui.py --video .\example\example.mp4 As I don't have a GPU, I change the map location to 'CPU'. When I select the "click" radio button and click on the object to create the mask, a runtime error is thrown. Could you give me some suggestions? Looking forward to your reply.

opened by xwhkkk 6
-- images mem_profile 2 | RuntimeError: All input tensors must be on the same device. Received cpu and cuda:0
To replicate:

create folder with only one image

with these settings run: python interactive_gui.py --mem_profile 2 --images ./example/test_folder/

select the "click" radio button

click on the image to create mask

select "scribble" radio button

"scribble" an area in the picture

runtime error is thrown
opened by zdhernandez 5
Overlay and Mask files not equal to size of original input image.
@hkchengrex Doing one image larger than 1k resolution in one folder with command: python interactive_gui.py --mem_profile 2 --images ./example/test_folder/

clicking on an object to produce the mask

click "save" to save the overlay and masks

Both overlay and mask files are reduced to a fix resolution of: width: 480px, height: 640px

Q. Can we keep the size of the output files to be equal to the input size of the original image? Q. Can we add a flag to use either current behavior or preserve the resolution of the input image ?
opened by zdhernandez 4
Getting "ValueError: Davis root folder must be named "DAVIS" Error when i try run eval_interactive_davis.py

Getting "ValueError: Davis root folder must be named "DAVIS" Error when i try run eval_interactive_davis.py

Traceback (most recent call last): File "/home/bereket/Desktop/IRCAD-Data/MiVOS/MiVOS-MiVOS-STCN/eval_interactive_davis.py", line 76, in with DavisInteractiveSession(davis_root=davis_path+'/trainval', report_save_dir='../output', max_nb_interactions=8, max_time=8*30) as sess: File "/home/bereket/anaconda3/envs/ivos/lib/python3.9/site-packages/davisinteractive/session/session.py", line 89, in enter samples, max_t, max_i = self.connector.start_session( File "/home/bereket/anaconda3/envs/ivos/lib/python3.9/site-packages/davisinteractive/connector/local.py", line 29, in start_session self.service = EvaluationService(davis_root=davis_root) File "/home/bereket/anaconda3/envs/ivos/lib/python3.9/site-packages/davisinteractive/evaluation/service.py", line 27, in init self.davis = Davis(davis_root=davis_root) File "/home/bereket/anaconda3/envs/ivos/lib/python3.9/site-packages/davisinteractive/dataset/davis.py", line 93, in init raise ValueError('Davis root folder must be named "DAVIS"') ValueError: Davis root folder must be named "DAVIS"

opened by be-redAsmara 4
Processing on long video with high resolution

Hello! Thank you for the amazing framework!

I have an issue while processing on long video with high resolution. I ran out of GPU memory. As I understand, mivos tries to upload all images directly to GPU and if the video is too long or in high-resolution mivos can't handle such cases. Is there is a way to fix this issue? Maybe modify code to work with data chunks?

Thank you in advance!

opened by devidlatkin 4
Has anyone met the following problem during the running of "interactive_gui.py"?

Traceback (most recent call last): File "interactive_gui.py", line 23, in from PyQt5.QtWidgets import (QWidget, QApplication, QMainWindow, QComboBox, QGridLayout, ImportError: /usr/lib/x86_64-linux-gnu/libQt5Core.so.5: version `Qt_5.15' not found (required by /home/fg/anaconda3/envs/MiVOS/lib/python3.7/site-packages/PyQt5/QtWidgets.abi3.so)

opened by Starboy-at-earth 4
static dataset in download_dataset.py

I note that there are a static dataset in download_dataset.py so, where is this static dataset used?

and in readme.md, you say, you use BL30K to train fusion model, and the BL30K is very large(600G), so ,you use 600 G dataset to pretrain fusion model?

opened by nazimii 3
Temporal Information

Hi, I am interested in your project and I would like to go in detail for an aspect related to temporal information. Are you training your model on video datasets? Are you getting temporal information from the dataset? or your model has been trained on single images considering only spatial information?

Thank you so much. Best, Francesca

opened by FrancescaCi 3
CPU profile 2 process throwing CUDA out of memory for one image with multiple items when propagate button is clicked
@hkchengrex To replicate:

load only one image of width(3024 by 4032) in folder./example/test_folder/

run command: python interactive_gui.py --mem_profile 2 --images ./example/test_folder/ --resolution -1 --num_objects 4

click on one object to create overlay of the first object (red)

select num keypad 2 and click a different object (to produce overlay of different color)

select num keypag 3 and click a different object (to produce overlay of different color)

select num keypag 3 and click a different object (to produce overlay of different color)

click "propagate" Throws error. See picture.

Even though I am doing one image if I click "Save" it does what is supposed to to (save overlay and mask). But clicking "Propagate" should not throw and error with cuda when --mem_profile was set to 2, right ? should not have used GPU.
opened by zdhernandez 7

Releases(1.0)

1.0(Mar 14, 2021)

Pretrained models and precomputed results
Source code(tar.gz)
Source code(zip)
fusion.pth(160.27 KB)
interactive_results.zip(17.39 MB)
propagation_model.pth(205.23 MB)
s2m.pth(152.04 MB)

Owner

Rex Cheng

@ HKUST.

GitHub https://hkchengrex.github.io/MiVOS/

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

129 Dec 11, 2022

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation This project hosts the code for implementing the DCT-MASK algorithms

57 Nov 27, 2022

This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

[CVPRW 2021] - Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation

6 May 3, 2022

[AAAI22] Reliable Propagation-Correction Modulation for Video Object Segmentation

Reliable Propagation-Correction Modulation for Video Object Segmentation (AAAI22) Preview version paper of this work is available at: https://arxiv.or

2 Dec 7, 2021

[CVPR'21] Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild

IVOS-W Paper Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild Zhaoyun Yin, Jia Zheng, Weixin Luo, Shenhan Qian, Hanli

38 Dec 12, 2022

POPPY (Physical Optics Propagation in Python) is a Python package that simulates physical optical propagation including diffraction

POPPY: Physical Optics Propagation in Python POPPY (Physical Optics Propagation in Python) is a Python package that simulates physical optical propaga

132 Dec 15, 2022

Reviving Iterative Training with Mask Guidance for Interactive Segmentation

This repository provides the source code for training and testing state-of-the-art click-based interactive segmentation models with the official PyTorch implementation

Visual Understanding Lab @ Samsung AI Center Moscow

406 Jan 1, 2023

Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network.

face-mask-detection Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network. It contains 3 scr

13 Jan 18, 2022

The Face Mask recognition system uses AI technology to detect the person with or without a mask.

Face Mask Detection Face Mask Detection system built with OpenCV, Keras/TensorFlow using Deep Learning and Computer Vision concepts in order to detect

4 Apr 5, 2022

Official PyTorch implementation of "Physics-aware Difference Graph Networks for Sparsely-Observed Dynamics".

Physics-aware Difference Graph Networks for Sparsely-Observed Dynamics This repository is the official PyTorch implementation of "Physics-aware Differ

46 Nov 20, 2022

[CVPR'21] MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation

MonoRUn MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation. CVPR 2021. [paper] Hansheng Chen, Yuyao Huang, Wei Tian*

同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University)

96 Dec 10, 2022

Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion (CVPR 2021)

Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion (CVPR 2021) This repository is for BAAF-Net introduce

90 Dec 29, 2022

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

114 Nov 28, 2022

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

Mask R-CNN for Object Detection and Segmentation This is an implementation of Mask R-CNN on Python 3, Keras, and TensorFlow. The model generates bound

22.5k Jan 4, 2023

Fusion-DHL: WiFi, IMU, and Floorplan Fusion for Dense History of Locations in Indoor Environments

Fusion-DHL: WiFi, IMU, and Floorplan Fusion for Dense History of Locations in Indoor Environments Paper: arXiv (ICRA 2021) Video : https://youtu.be/CC

68 Jan 3, 2023

Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

76 Dec 14, 2022

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals.

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals This repo contains the Pytorch implementation of our paper: Unsupervised Seman

335 Dec 28, 2022

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

184 Dec 11, 2022

[CVPR 2021] MiVOS - Scribble to Mask module

MiVOS (CVPR 2021) - Scribble To Mask Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang [arXiv] [Paper PDF] [Project Page] A simplistic network that turns scri

65 Dec 22, 2022

[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

Related tags

Overview

Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion (MiVOS)

Overall structure and capabilities

Framework

Requirements

Quick start

Main Results

DAVIS Interactive Track

Pretrained models

Training

Data preparation

BL30K

Download

Generation

Fusion data

Training commands

Credit

Citation

Comments

Releases(1.0)

1.0(Mar 14, 2021)

Owner

Rex Cheng

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

[AAAI22] Reliable Propagation-Correction Modulation for Video Object Segmentation

[CVPR'21] Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild

POPPY (Physical Optics Propagation in Python) is a Python package that simulates physical optical propagation including diffraction

Reviving Iterative Training with Mask Guidance for Interactive Segmentation

Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network.

The Face Mask recognition system uses AI technology to detect the person with or without a mask.

Official PyTorch implementation of "Physics-aware Difference Graph Networks for Sparsely-Observed Dynamics".

[CVPR'21] MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation

Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion (CVPR 2021)

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

Fusion-DHL: WiFi, IMU, and Floorplan Fusion for Dense History of Locations in Indoor Environments

Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals.

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

[CVPR 2021] MiVOS - Scribble to Mask module