End-to-End Referring Video Object Segmentation with Multimodal Transformers

Last update: Dec 30, 2022

Related tags

Deep Learning MTTR

Overview

End-to-End Referring Video Object Segmentation with Multimodal Transformers

This repo contains the official implementation of the paper:

End-to-End Referring Video Object Segmentation with Multimodal Transformers

MTTR_preview.mp4

How to Run the Code

First, clone this repo to your local machine using:

git clone https://github.com/mttr2021/MTTR.git

Dataset Requirements

A2D-Sentences

Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── a2d_sentences/ 
    ├── Release/
    │   ├── videoset.csv  (videos metadata file)
    │   └── CLIPS320/
    │       └── *.mp4     (video files)
    └── text_annotations/
        ├── a2d_annotation.txt  (actual text annotations)
        ├── a2d_missed_videos.txt
        └── a2d_annotation_with_instances/ 
            └── */ (video folders)
                └── *.h5 (annotations files)

###JHMDB-Sentences Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── jhmdb_sentences/ 
    ├── Rename_Images/  (frame images)
    │   └── */ (action dirs)
    ├── puppet_mask/  (mask annotations)
    │   └── */ (action dirs)
    └── jhmdb_annotation.txt  (text annotations)

Refer-YouTube-VOS

Download the dataset from the competition's website here.

Note that you may be required to sign up to the competition in order to get access to the dataset. This registration process is free and short.

Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── refer_youtube_vos/ 
    ├── train/
    │   ├── JPEGImages/
    │   │   └── */ (video folders)
    │   │       └── *.jpg (frame image files) 
    │   └── Annotations/
    │       └── */ (video folders)
    │           └── *.png (mask annotation files) 
    ├── valid/
    │   └── JPEGImages/
    │       └── */ (video folders)
    │           └── *.jpg (frame image files) 
    └── meta_expressions/
        ├── train/
        │   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)

Environment Installation

The code was tested on a Conda environment installed on Ubuntu 18.04. Install Conda and then create an environment as follows:

conda create -n mttr python=3.9.7 pip -y

conda activate mttr

Pytorch 1.10:

conda install pytorch==1.10.0 torchvision==0.11.1 -c pytorch -c conda-forge

Note that you might have to change the cudatoolkit version above according to your system's CUDA version.

Hugging Face transformers 4.11.3:

pip install transformers==4.11.3

COCO API (for mAP calculations):

pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

Additional required packages:

pip install h5py wandb opencv-python protobuf av einops ruamel.yaml timm joblib

conda install -c conda-forge pandas matplotlib cython scipy cupy

Running Configuration

The following table lists the parameters which can be configured directly from the command line.

The rest of the running/model parameters for each dataset can be configured in configs/DATASET_NAME.yaml.

Note that in order to run the code the path of the relevant .yaml config file needs to be supplied using the -c parameter.

Command	Description
-c	path to dataset configuration file
-rm	running mode (train/eval)
-ws	window size
-bs	training batch size per GPU
-ebs	eval batch size per GPU (if not provided, training batch size is used)
-ng	number of GPUs to run on

Evaluation

The following commands can be used to reproduce the main results of our paper using the supplied checkpoint files.

The commands were tested on RTX 3090 24GB GPUs, but it may be possible to run some of them using GPUs with less memory by decreasing the batch-size -bs parameter.

A2D-Sentences

Window Size	Command	Checkpoint File	mAP Result
10	`python main.py -rm eval -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2`	Link	46.1
8	`python main.py -rm eval -c configs/a2d_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2`	Link	44.7

JHMDB-Sentences

The following commands evaluate our A2D-Sentences-pretrained model on JHMDB-Sentences without additional training.

For this purpose, as explained in our paper, we uniformly sample three frames from each video. To ensure proper reproduction of our results on other machines we include the metadata of the sampled frames under datasets/jhmdb_sentences/jhmdb_sentences_samples_metadata.json. This file is automatically loaded during the evaluation process with the commands below.

To avoid using this file and force sampling different frames, change the seed and generate_new_samples_metadata parameters under MTTR/configs/jhmdb_sentences.yaml.

Window Size	Command	Checkpoint File	mAP Result
10	`python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2`	Link	39.2
8	`python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2`	Link	36.6

Refer-YouTube-VOS

The following command evaluates our model on the public validation subset of Refer-YouTube-VOS dataset. Since annotations are not publicly available for this subset, our code generates a zip file with the predicted masks under MTTR/runs/[RUN_DATE_TIME]/validation_outputs/submission_epoch_0.zip. This zip needs to be uploaded to the competition server for evaluation. For your convenience we supply this zip file here as well.

Window Size	Command	Checkpoint File	Output Zip	J&F Result
12	`python main.py -rm eval -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ckpt CHECKPOINT_PATH -ng 8`	Link	Link	55.32

Training

First, download the Kinetics-400 pretrained weights of Video Swin Transformer from this link. Note that these weights were originally published in video swin's original repo here.

Place the downloaded file inside your cloned repo directory as MTTR/pretrained_swin_transformer/swin_tiny_patch244_window877_kinetics400_1k.pth.

Next, the following commands can be used to train MTTR as described in our paper.

Note that it may be possible to run some of these commands on GPUs with less memory than the ones mentioned below by decreasing the batch-size -bs or window-size -ws parameters. However, changing these parameters may also affect the final performance of the model.

A2D-Sentences

The command for the following configuration was tested on 2 A6000 48GB GPUs:

Window Size	Command
10	`python main.py -rm train -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ng 2`

The command for the following configuration was tested on 3 RTX 3090 24GB GPUs:

Window Size	Command
8	`python main.py -rm train -c configs/a2d_sentences.yaml -ws 8 -bs 2 -ng 3`

Refer-YouTube-VOS

The command for the following configuration was tested on 4 A6000 48GB GPUs:

Window Size	Command
12	`python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ng 4`

The command for the following configuration was tested on 8 RTX 3090 24GB GPUs.

Window Size	Command
8	`python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 8 -bs 1 -ng 8`

Note that this last configuration was not mentioned in our paper. However, it is more memory efficient than the original configuration (window size 12) while producing a model which is almost as good (J&F of 54.56 in our experiments).

JHMDB-Sentences

As explained in our paper JHMDB-Sentences is used exclusively for evaluation, so training is not supported at this time for this dataset.

Comments

Does the gpu number effect the final results?

Hello, thanks to your great work and publication.

I try to train the ref-youtube-vos on two Titan RTX with command like python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 8 -bs 1 -ng 2. And I got the best result 53.68 J&F , not 54.56. Since I only change the gpu number, so is the gpu number matters? Or are there any configurations I should change?

opened by colorblank 4

Update a2d_sentences_dataset.py

It avoids a h5py file lock issue in which multiple process cannot read the same file.

  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

  File "h5py/h5f.pyx", line 108, in h5py.h5f.create
OSError: Unable to create file (unable to open file: name = '/path/to/dataset/text_annotations/a2d_annotation_with_instances/-AyZV127fCg/00040.h5', errno = 17, erro
r message = 'File exists', flags = 15, o_flags = c2)

opened by sunggukcha 3

Can you tell me how many stages in your work

Hello, great thanks to your work and publication. Can you tell me how many stages in your work. I want to know when training a model, whether to use only one of the A2D-Sentences and JHMDB-Sentences data, or divide it into two stages. first use the A2D-Sentences dataset for training, and then use the JHMDB-Sentences dataset for training. Thank you.

opened by sutiankang 1
Can you provide visualization script?

Hello, great thanks to your work and publication.

Can you provide visualization script? I am trying and stucked in dealing 50 queries. Can you explain or refer how to choose which prediction out of the 50 is to be used?

opened by sunggukcha 1
predicitons for one video contaning 2-3 actions

Well, let me start with thanking you with this great work. It's awesome.

If, for example, I have a 20s-long video where several actions take place for 4-5s each. Is the model capable of predicting them as well? I mean, usually, models assign particular category to one video, but this case different, therefore I was wondering if it can predict outputs when one video with several actions is shown. Thank you.

opened by bit-scientist 1
add a script for automatic dataset preparation & some repo links about RVOS and visual grounding

23/12/2021

We add a script for automatic dataset preparation.🚀

15/12/2021

We add some repos related to RVOS or visual grounding for reference at the bottom of this page!🤗

opened by JerryX1110 0
The configuration of MTTR when windows size is 8

In the paper, I notice that there missing the hyperparameters configurations of MTTR when the windows size is 8. Can you please share the hyperparameters?

opened by RobertLuo1 0
ValueError: matrix contains invalid numeric entries

I train your model on A2D sentences and take 24 hours on 2 3090GPU, but an error interrupted the train on epoch 27. File "/home/users2/xjl/anaconda3/envs/mttr/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/users2/xjl/workspace/MTTR/models/matcher.py", line 81, in forward indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(num_traj_per_batch, -1))] File "/home/users2/xjl/workspace/MTTR/models/matcher.py", line 81, in indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(num_traj_per_batch, -1))] ValueError: matrix contains invalid numeric entries

opened by xiejialong 0
Does it really work?

Hello,

First thanks for sharing the code. Actually, I tried to estimate the segmentation masks with other queries than those you provided for each video (for this I used your demonstration on Google Colab). However, the result was super disappointing. Whatever query I use for the videos, the network still returns the segmentation masks related to the original queries. Here is just an example of many I tries with your network: A random frame of the original video:

The original queries provided for this video: 'man in red shirt playing tennis', 'white tennis racket held by a man in a red shirt The output mask generated by the network for the first query 'man in red shirt playing tennis':

Now if we change the queries to: 'a man wearing a black shirt sitting on the side', 'a man standing at the back of tennis court' the result for the first query 'a man wearing a black shirt sitting on the side'

Now, we change the queries to: 'referee of a tennis match', 'the wall at the end of a tennis court' the output mask from the network for the first query 'referee of a tennis match'

In fact, whatever queries you provide to the network, the output mask is related to one of those original queries originally assigned to the video. If we say "a kid in Disneyland" it returns one of these two players, "a lady reports for a tv show" it still returns one of these two players.

opened by Mathilda88 0

An error occurs when training on Refer-YouTube-VOS

I follow your Environment Installation. But when executing the following line https://github.com/mttr2021/MTTR/blob/c383c5b151e3c97aeb45cd2fb4bf08719016498b/trainer.py#L175

an error occurs.

Traceback (most recent call last):
  File "/media/ssd/users/xxx/software/anaconda3/envs/mttr0/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/media/ssd/users/xxx/projects/MTTR/main.py", line 20, in run
    trainer.train()
  File "/media/ssd/users/xxx/projects/MTTR/trainer.py", line 175, in train
    self.lr_scheduler.step(total_epoch_loss)  # note that this loss is synced across all processes
  File "/media/ssd/users/xxx/software/anaconda3/envs/mttr0/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 164, in step
    self.print_lr(self.verbose, i, lr, epoch)
  File "/media/ssd/users/xxx/software/anaconda3/envs/mttr0/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 113, in print_lr
    print('Epoch {:5d}: adjusting learning rate'
ValueError: Unknown format code 'd' for object of type 'float'

Did you met this bug when training on Refer-YouTube-VOS?

If I change this line to

            self.lr_scheduler.step()  # note that this loss is synced across all processes

will it influence the result?

opened by hoyeYang 0

About einsum function in mttr.py

Thanks for your generous sharing of this wonderful research results! But I have a problem in reproducing the results.

Here I don't understand how the tensor changes in the einsum function and the reason why to conduct it.

Sorry I don't see any details in the paper, looking forward to your reply

opened by kaizhang1215 0
Can you tell me how should I configure the checkpoint path?plz..

I tried a command like this“python main.py -rm eval -c configs/a2d_sentences.yaml -ws 10 -bs 1 -ckpt archive -ng 1 ” but got an error " PermissionError: [Errno 13] Permission denied: 'archive' "

I will be very grateful if you reply in time.

opened by UreisenI 1

Owner

GitHub

Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

Vision-Language Transformer and Query Generation for Referring Segmentation Please consider citing our paper in your publications if the project helps

143 Dec 23, 2022

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

687 Jan 7, 2023

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

METER: Multimodal End-to-end TransformER

METER Code and pre-trained models will be publicized soon. Citation @article{dou2021meter, title={An Empirical Study of Training End-to-End Vision-a

257 Jan 6, 2023

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab

89 Dec 26, 2022

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

114 Nov 28, 2022

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

182 Dec 19, 2022

Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"

MeshTransformer ✨ This is our research code of End-to-End Human Pose and Mesh Reconstruction with Transformers. MEsh TRansfOrmer is a simple yet effec

473 Dec 31, 2022

[Preprint] "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" by Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

Chasing Sparsity in Vision Transformers: An End-to-End Exploration Codes for [Preprint] Chasing Sparsity in Vision Transformers: An End-to-End Explora

64 Dec 8, 2022

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University)

842 Jan 4, 2023

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Related tags

Overview

End-to-End Referring Video Object Segmentation with Multimodal Transformers

How to Run the Code

Dataset Requirements

A2D-Sentences

Refer-YouTube-VOS

Environment Installation

Running Configuration

Evaluation

A2D-Sentences

JHMDB-Sentences

Refer-YouTube-VOS

Training

A2D-Sentences

Refer-YouTube-VOS

JHMDB-Sentences

Comments

Owner

Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

METER: Multimodal End-to-end TransformER

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"

[Preprint] "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" by Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

Source code for "Progressive Transformers for End-to-End Sign Language Production" (ECCV 2020)

PSTR: End-to-End One-Step Person Search With Transformers (CVPR2022)

REGTR: End-to-end Point Cloud Correspondences with Transformers

End-to-End Object Detection with Fully Convolutional Network

Deformable DETR is an efficient and fast-converging end-to-end object detector.

This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals, CVPR2021

Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation