End-to-End Referring Video Object Segmentation with Multimodal Transformers

Related tags

Deep Learning MTTR
Overview

End-to-End Referring Video Object Segmentation with Multimodal Transformers

License Framework

This repo contains the official implementation of the paper:


End-to-End Referring Video Object Segmentation with Multimodal Transformers

MTTR_preview.mp4

How to Run the Code

First, clone this repo to your local machine using:

git clone https://github.com/mttr2021/MTTR.git

Dataset Requirements

A2D-Sentences

Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── a2d_sentences/ 
    ├── Release/
    │   ├── videoset.csv  (videos metadata file)
    │   └── CLIPS320/
    │       └── *.mp4     (video files)
    └── text_annotations/
        ├── a2d_annotation.txt  (actual text annotations)
        ├── a2d_missed_videos.txt
        └── a2d_annotation_with_instances/ 
            └── */ (video folders)
                └── *.h5 (annotations files) 

###JHMDB-Sentences Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── jhmdb_sentences/ 
    ├── Rename_Images/  (frame images)
    │   └── */ (action dirs)
    ├── puppet_mask/  (mask annotations)
    │   └── */ (action dirs)
    └── jhmdb_annotation.txt  (text annotations)

Refer-YouTube-VOS

Download the dataset from the competition's website here.

Note that you may be required to sign up to the competition in order to get access to the dataset. This registration process is free and short.

Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── refer_youtube_vos/ 
    ├── train/
    │   ├── JPEGImages/
    │   │   └── */ (video folders)
    │   │       └── *.jpg (frame image files) 
    │   └── Annotations/
    │       └── */ (video folders)
    │           └── *.png (mask annotation files) 
    ├── valid/
    │   └── JPEGImages/
    │       └── */ (video folders)
    │           └── *.jpg (frame image files) 
    └── meta_expressions/
        ├── train/
        │   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)

Environment Installation

The code was tested on a Conda environment installed on Ubuntu 18.04. Install Conda and then create an environment as follows:

conda create -n mttr python=3.9.7 pip -y

conda activate mttr

  • Pytorch 1.10:

conda install pytorch==1.10.0 torchvision==0.11.1 -c pytorch -c conda-forge

Note that you might have to change the cudatoolkit version above according to your system's CUDA version.

  • Hugging Face transformers 4.11.3:

pip install transformers==4.11.3

  • COCO API (for mAP calculations):

pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

  • Additional required packages:

pip install h5py wandb opencv-python protobuf av einops ruamel.yaml timm joblib

conda install -c conda-forge pandas matplotlib cython scipy cupy

Running Configuration

The following table lists the parameters which can be configured directly from the command line.

The rest of the running/model parameters for each dataset can be configured in configs/DATASET_NAME.yaml.

Note that in order to run the code the path of the relevant .yaml config file needs to be supplied using the -c parameter.

Command Description
-c path to dataset configuration file
-rm running mode (train/eval)
-ws window size
-bs training batch size per GPU
-ebs eval batch size per GPU (if not provided, training batch size is used)
-ng number of GPUs to run on

Evaluation

The following commands can be used to reproduce the main results of our paper using the supplied checkpoint files.

The commands were tested on RTX 3090 24GB GPUs, but it may be possible to run some of them using GPUs with less memory by decreasing the batch-size -bs parameter.

A2D-Sentences

Window Size Command Checkpoint File mAP Result
10 python main.py -rm eval -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 46.1
8 python main.py -rm eval -c configs/a2d_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 44.7

JHMDB-Sentences

The following commands evaluate our A2D-Sentences-pretrained model on JHMDB-Sentences without additional training.

For this purpose, as explained in our paper, we uniformly sample three frames from each video. To ensure proper reproduction of our results on other machines we include the metadata of the sampled frames under datasets/jhmdb_sentences/jhmdb_sentences_samples_metadata.json. This file is automatically loaded during the evaluation process with the commands below.

To avoid using this file and force sampling different frames, change the seed and generate_new_samples_metadata parameters under MTTR/configs/jhmdb_sentences.yaml.

Window Size Command Checkpoint File mAP Result
10 python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 39.2
8 python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 36.6

Refer-YouTube-VOS

The following command evaluates our model on the public validation subset of Refer-YouTube-VOS dataset. Since annotations are not publicly available for this subset, our code generates a zip file with the predicted masks under MTTR/runs/[RUN_DATE_TIME]/validation_outputs/submission_epoch_0.zip. This zip needs to be uploaded to the competition server for evaluation. For your convenience we supply this zip file here as well.

Window Size Command Checkpoint File Output Zip J&F Result
12 python main.py -rm eval -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ckpt CHECKPOINT_PATH -ng 8 Link Link 55.32

Training

First, download the Kinetics-400 pretrained weights of Video Swin Transformer from this link. Note that these weights were originally published in video swin's original repo here.

Place the downloaded file inside your cloned repo directory as MTTR/pretrained_swin_transformer/swin_tiny_patch244_window877_kinetics400_1k.pth.

Next, the following commands can be used to train MTTR as described in our paper.

Note that it may be possible to run some of these commands on GPUs with less memory than the ones mentioned below by decreasing the batch-size -bs or window-size -ws parameters. However, changing these parameters may also affect the final performance of the model.

A2D-Sentences

  • The command for the following configuration was tested on 2 A6000 48GB GPUs:
Window Size Command
10 python main.py -rm train -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ng 2
  • The command for the following configuration was tested on 3 RTX 3090 24GB GPUs:
Window Size Command
8 python main.py -rm train -c configs/a2d_sentences.yaml -ws 8 -bs 2 -ng 3

Refer-YouTube-VOS

  • The command for the following configuration was tested on 4 A6000 48GB GPUs:
Window Size Command
12 python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ng 4
  • The command for the following configuration was tested on 8 RTX 3090 24GB GPUs.
Window Size Command
8 python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 8 -bs 1 -ng 8

Note that this last configuration was not mentioned in our paper. However, it is more memory efficient than the original configuration (window size 12) while producing a model which is almost as good (J&F of 54.56 in our experiments).

JHMDB-Sentences

As explained in our paper JHMDB-Sentences is used exclusively for evaluation, so training is not supported at this time for this dataset.

Comments
  • Does the gpu number effect the final results?

    Does the gpu number effect the final results?

    Hello, thanks to your great work and publication.

    I try to train the ref-youtube-vos on two Titan RTX with command like python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 8 -bs 1 -ng 2. And I got the best result 53.68 J&F , not 54.56. Since I only change the gpu number, so is the gpu number matters? Or are there any configurations I should change?

    opened by colorblank 4
  • Update a2d_sentences_dataset.py

    Update a2d_sentences_dataset.py

    It avoids a h5py file lock issue in which multiple process cannot read the same file.

      File "h5py/h5f.pyx", line 88, in h5py.h5f.open
    OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
    
      File "h5py/h5f.pyx", line 108, in h5py.h5f.create
    OSError: Unable to create file (unable to open file: name = '/path/to/dataset/text_annotations/a2d_annotation_with_instances/-AyZV127fCg/00040.h5', errno = 17, erro
    r message = 'File exists', flags = 15, o_flags = c2)
    
    opened by sunggukcha 3
  • Can you tell me how many stages in your work

    Can you tell me how many stages in your work

    Hello, great thanks to your work and publication. Can you tell me how many stages in your work. I want to know when training a model, whether to use only one of the A2D-Sentences and JHMDB-Sentences data, or divide it into two stages. first use the A2D-Sentences dataset for training, and then use the JHMDB-Sentences dataset for training. Thank you.

    opened by sutiankang 1
  • Can you provide visualization script?

    Can you provide visualization script?

    Hello, great thanks to your work and publication.

    Can you provide visualization script? I am trying and stucked in dealing 50 queries. Can you explain or refer how to choose which prediction out of the 50 is to be used?

    opened by sunggukcha 1
  • predicitons for one video contaning 2-3 actions

    predicitons for one video contaning 2-3 actions

    Well, let me start with thanking you with this great work. It's awesome.

    If, for example, I have a 20s-long video where several actions take place for 4-5s each. Is the model capable of predicting them as well? I mean, usually, models assign particular category to one video, but this case different, therefore I was wondering if it can predict outputs when one video with several actions is shown. Thank you.

    opened by bit-scientist 1
  • add a script for automatic dataset preparation & some repo links about RVOS and visual grounding

    add a script for automatic dataset preparation & some repo links about RVOS and visual grounding

    23/12/2021

    We add a script for automatic dataset preparation.🚀

    15/12/2021

    We add some repos related to RVOS or visual grounding for reference at the bottom of this page!🤗

    opened by JerryX1110 0
  • The configuration of MTTR when windows size is 8

    The configuration of MTTR when windows size is 8

    In the paper, I notice that there missing the hyperparameters configurations of MTTR when the windows size is 8. Can you please share the hyperparameters?

    opened by RobertLuo1 0
  • ValueError: matrix contains invalid numeric entries

    ValueError: matrix contains invalid numeric entries

    I train your model on A2D sentences and take 24 hours on 2 3090GPU, but an error interrupted the train on epoch 27. File "/home/users2/xjl/anaconda3/envs/mttr/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/users2/xjl/workspace/MTTR/models/matcher.py", line 81, in forward indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(num_traj_per_batch, -1))] File "/home/users2/xjl/workspace/MTTR/models/matcher.py", line 81, in indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(num_traj_per_batch, -1))] ValueError: matrix contains invalid numeric entries

    opened by xiejialong 0
  • Does it really work?

    Does it really work?

    Hello,

    First thanks for sharing the code. Actually, I tried to estimate the segmentation masks with other queries than those you provided for each video (for this I used your demonstration on Google Colab). However, the result was super disappointing. Whatever query I use for the videos, the network still returns the segmentation masks related to the original queries. Here is just an example of many I tries with your network: A random frame of the original video: org_tennis

    The original queries provided for this video: 'man in red shirt playing tennis', 'white tennis racket held by a man in a red shirt The output mask generated by the network for the first query 'man in red shirt playing tennis': mask_1

    Now if we change the queries to: 'a man wearing a black shirt sitting on the side', 'a man standing at the back of tennis court' the result for the first query 'a man wearing a black shirt sitting on the side' mask_2

    Now, we change the queries to: 'referee of a tennis match', 'the wall at the end of a tennis court' the output mask from the network for the first query 'referee of a tennis match' mask_3

    In fact, whatever queries you provide to the network, the output mask is related to one of those original queries originally assigned to the video. If we say "a kid in Disneyland" it returns one of these two players, "a lady reports for a tv show" it still returns one of these two players.

    opened by Mathilda88 0
  • An error occurs when training on Refer-YouTube-VOS

    An error occurs when training on Refer-YouTube-VOS

    I follow your Environment Installation. But when executing the following line https://github.com/mttr2021/MTTR/blob/c383c5b151e3c97aeb45cd2fb4bf08719016498b/trainer.py#L175

    an error occurs.

    Traceback (most recent call last):
      File "/media/ssd/users/xxx/software/anaconda3/envs/mttr0/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
        fn(i, *args)
      File "/media/ssd/users/xxx/projects/MTTR/main.py", line 20, in run
        trainer.train()
      File "/media/ssd/users/xxx/projects/MTTR/trainer.py", line 175, in train
        self.lr_scheduler.step(total_epoch_loss)  # note that this loss is synced across all processes
      File "/media/ssd/users/xxx/software/anaconda3/envs/mttr0/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 164, in step
        self.print_lr(self.verbose, i, lr, epoch)
      File "/media/ssd/users/xxx/software/anaconda3/envs/mttr0/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 113, in print_lr
        print('Epoch {:5d}: adjusting learning rate'
    ValueError: Unknown format code 'd' for object of type 'float'
    

    Did you met this bug when training on Refer-YouTube-VOS?

    If I change this line to

                self.lr_scheduler.step()  # note that this loss is synced across all processes
    

    will it influence the result?

    opened by hoyeYang 0
  • About einsum function in mttr.py

    About einsum function in mttr.py

    Thanks for your generous sharing of this wonderful research results! But I have a problem in reproducing the results.

    Here I don't understand how the tensor changes in the einsum function and the reason why to conduct it.

    Sorry I don't see any details in the paper, looking forward to your reply

    opened by kaizhang1215 0
  • Can you tell me how should I configure the checkpoint path?plz..

    Can you tell me how should I configure the checkpoint path?plz..

    I tried a command like this“python main.py -rm eval -c configs/a2d_sentences.yaml -ws 10 -bs 1 -ckpt archive -ng 1 ” but got an error " PermissionError: [Errno 13] Permission denied: 'archive' "

    I will be very grateful if you reply in time.

    opened by UreisenI 1
Owner
null
Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

Vision-Language Transformer and Query Generation for Referring Segmentation Please consider citing our paper in your publications if the project helps

Henghui Ding 143 Dec 23, 2022
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 687 Jan 7, 2023
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 9, 2023
METER: Multimodal End-to-end TransformER

METER Code and pre-trained models will be publicized soon. Citation @article{dou2021meter, title={An Empirical Study of Training End-to-End Vision-a

Zi-Yi Dou 257 Jan 6, 2023
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

Kakao Brain 114 Nov 28, 2022
ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

Jie Hu 182 Dec 19, 2022
Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"

MeshTransformer ✨ This is our research code of End-to-End Human Pose and Mesh Reconstruction with Transformers. MEsh TRansfOrmer is a simple yet effec

Microsoft 473 Dec 31, 2022
[Preprint] "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" by Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

Chasing Sparsity in Vision Transformers: An End-to-End Exploration Codes for [Preprint] Chasing Sparsity in Vision Transformers: An End-to-End Explora

VITA 64 Dec 8, 2022
Source code for "Progressive Transformers for End-to-End Sign Language Production" (ECCV 2020)

Progressive Transformers for End-to-End Sign Language Production Source code for "Progressive Transformers for End-to-End Sign Language Production" (B

null 58 Dec 21, 2022
PSTR: End-to-End One-Step Person Search With Transformers (CVPR2022)

PSTR (CVPR2022) This code is an official implementation of "PSTR: End-to-End One-Step Person Search With Transformers (CVPR2022)". End-to-end one-step

Jiale Cao 28 Dec 13, 2022
REGTR: End-to-end Point Cloud Correspondences with Transformers

REGTR: End-to-end Point Cloud Correspondences with Transformers This repository contains the source code for REGTR. REGTR utilizes multiple transforme

Zi Jian Yew 108 Dec 17, 2022
End-to-End Object Detection with Fully Convolutional Network

This project provides an implementation for "End-to-End Object Detection with Fully Convolutional Network" on PyTorch.

null 472 Dec 22, 2022
Deformable DETR is an efficient and fast-converging end-to-end object detector.

Deformable DETR: Deformable Transformers for End-to-End Object Detection.

null 2k Jan 5, 2023
This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

MOTR: End-to-End Multiple-Object Tracking with TRansformer This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object

null 348 Jan 7, 2023
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals, CVPR2021

End-to-End Object Detection with Learnable Proposal, CVPR2021

Peize Sun 1.2k Dec 27, 2022
Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

3DETR: An End-to-End Transformer Model for 3D Object Detection PyTorch implementation and models for 3DETR. 3DETR (3D DEtection TRansformer) is a simp

Facebook Research 487 Dec 31, 2022
Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework

This repo is the official implementation of "Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework". @inproceedings{zhou2021insta

null 34 Dec 31, 2022
[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

 同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University) 842 Jan 4, 2023