[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

WebsiteSTVG DemoPaper

PWC PWC PWC

This repository provides the code for our paper. This includes:

  • Software setup, data downloading and preprocessing instructions for the VidSTG, HC-STVG1 and HC-STVG2.0 datasets
  • Training scripts and pretrained checkpoints
  • Evaluation scripts and demo

Setup

Download FFMPEG and add it to the PATH environment variable. The code was tested with version ffmpeg-4.2.2-amd64-static. Then create a conda environment and install the requirements with the following commands:

conda create -n tubedetr_env python=3.8
conda activate tubedetr_env
pip install -r requirements.txt

Data Downloading

Setup the paths where you are going to download videos and annotations in the config json files.

VidSTG: Download VidOR videos and annotations from the VidOR dataset providers. Then download the VidSTG annotations from the VidSTG dataset providers. The vidstg_vid_path folder should contain a folder video containing the unzipped video folders. The vidstg_ann_path folder should contain both VidOR and VidSTG annotations.

HC-STVG: Download HC-STVG1 and HC-STVG2.0 videos and annotations from the HC-STVG dataset providers. The hcstvg_vid_path folder should contain a folder video containing the unzipped video folders. The hcstvg_ann_path folder should contain both HC-STVG1 and HC-STVG2.0 annotations.

Data Preprocessing

To preprocess annotation files, run:

python preproc/preproc_vidstg.py
python preproc/preproc_hcstvg.py
python preproc/preproc_hcstvgv2.py

Training

Download pretrained RoBERTa tokenizer and model weights in the TRANSFORMERS_CACHE folder. Download pretrained ResNet-101 model weights in the TORCH_HOME folder. Download MDETR pretrained model weights with ResNet-101 backbone in the current folder.

VidSTG To train on VidSTG, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=vidstg --combine_datasets_val=vidstg \
--dataset_config config/vidstg.json --output-dir=OUTPUT_DIR

HC-STVG2.0 To train on HC-STVG2.0, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=OUTPUT_DIR

HC-STVG1 To train on HC-STVG1, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--dataset_config config/hcstvg.json --epochs=40 --eval_skip=40 --output-dir=OUTPUT_DIR

Baselines

  • To remove time encoding, add --no_time_embed.
  • To remove the temporal self-attention in the space-time decoder, add --no_tsa.
  • To train from ImageNet initialization, pass an empty string to the argument --load and add --sted_loss_coef=5 --lr=2e-5 --text_encoder_lr=2e-5 --epochs=20 --lr_drop=20 for VidSTG or --epochs=60 --lr_drop=60 for HC-STVG1.
  • To train with a randomly initalized temporal self-attention, add --rd_init_tsa.
  • To train with a different spatial resolution (e.g. res=352) or temporal stride (e.g. k=4), add --resolution=224 or --stride=5.
  • To train with the slow-only variant, add --no_fast.
  • To train with alternative designs for the fast branch, add --fast=VARIANT.

Available Checkpoints

Training data parameters url size
MDETR init + VidSTG k=4 res=352 Drive 3.0GB
MDETR init + VidSTG k=2 res=224 Drive 3.0GB
ImageNet init + VidSTG k=4 res=352 Drive 3.0GB
MDETR init + HC-STVG2.0 k=4 res=352 Drive 3.0GB
MDETR init + HC-STVG2.0 k=2 res=224 Drive 3.0GB
MDETR init + HC-STVG1 k=4 res=352 Drive 3.0GB
ImageNet init + HC-STVG1 k=4 res=352 Drive 3.0GB

Evaluation

For evaluation only, simply run the same commands as for training with --resume=CHECKPOINT --eval. For this to be done on the test set, add --test (in this case predictions and attention weights are also saved).

Spatio-Temporal Video Grounding Demo

You can also use a pretrained model to infer a spatio-temporal tube on a video of your choice (VIDEO_PATH with potential START and END timestamps) given the natural language query of your choice (CAPTION) with the following command:

python demo_stvg.py --load=CHECKPOINT --caption_example CAPTION --video_example VIDEO_PATH --start_example=START --end_example=END --output-dir OUTPUT_PATH

Note that we also host an online demo at this link, the code of which is available at server_stvg.py and server_stvg.html.

Acknowledgements

This codebase is built on the MDETR codebase. The code for video spatial data augmentation is inspired by torch_videovision.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yang2022tubedetr,
title={TubeDETR: Spatio-Temporal Video Grounding with Transformers},
author={Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}}
Comments
  • KeyError:'model' in main.py 552

    KeyError:'model' in main.py 552

    I downloaded the checkpoint file from the download link found on the pytorch official website according to the instructions of the readme file. After importing, I did not find the key——"model" or "model_ema" for checkpoint. The download link is https://download.pytorch.org/models/resnet101-63fe2227.pth

    The checkpoint output is: conv1.weight bn1.running_mean bn1.running_var bn1.weight bn1.bias layer1.0.conv1.weight layer1.0.bn1.running_mean layer1.0.bn1.running_var layer1.0.bn1.weight layer1.0.bn1.bias layer1.0.conv2.weight layer1.0.bn2.running_mean layer1.0.bn2.running_var layer1.0.bn2.weight layer1.0.bn2.bias layer1.0.conv3.weight layer1.0.bn3.running_mean layer1.0.bn3.running_var layer1.0.bn3.weight layer1.0.bn3.bias layer1.0.downsample.0.weight layer1.0.downsample.1.running_mean layer1.0.downsample.1.running_var layer1.0.downsample.1.weight layer1.0.downsample.1.bias layer1.1.conv1.weight layer1.1.bn1.running_mean layer1.1.bn1.running_var layer1.1.bn1.weight layer1.1.bn1.bias layer1.1.conv2.weight layer1.1.bn2.running_mean layer1.1.bn2.running_var layer1.1.bn2.weight layer1.1.bn2.bias layer1.1.conv3.weight layer1.1.bn3.running_mean layer1.1.bn3.running_var layer1.1.bn3.weight layer1.1.bn3.bias layer1.2.conv1.weight layer1.2.bn1.running_mean layer1.2.bn1.running_var layer1.2.bn1.weight layer1.2.bn1.bias layer1.2.conv2.weight layer1.2.bn2.running_mean layer1.2.bn2.running_var layer1.2.bn2.weight layer1.2.bn2.bias layer1.2.conv3.weight layer1.2.bn3.running_mean layer1.2.bn3.running_var layer1.2.bn3.weight layer1.2.bn3.bias layer2.0.conv1.weight layer2.0.bn1.running_mean layer2.0.bn1.running_var layer2.0.bn1.weight layer2.0.bn1.bias layer2.0.conv2.weight layer2.0.bn2.running_mean layer2.0.bn2.running_var layer2.0.bn2.weight layer2.0.bn2.bias layer2.0.conv3.weight layer2.0.bn3.running_mean layer2.0.bn3.running_var layer2.0.bn3.weight layer2.0.bn3.bias layer2.0.downsample.0.weight layer2.0.downsample.1.running_mean layer2.0.downsample.1.running_var layer2.0.downsample.1.weight layer2.0.downsample.1.bias layer2.1.conv1.weight layer2.1.bn1.running_mean layer2.1.bn1.running_var layer2.1.bn1.weight layer2.1.bn1.bias layer2.1.conv2.weight layer2.1.bn2.running_mean layer2.1.bn2.running_var layer2.1.bn2.weight layer2.1.bn2.bias layer2.1.conv3.weight layer2.1.bn3.running_mean layer2.1.bn3.running_var layer2.1.bn3.weight layer2.1.bn3.bias layer2.2.conv1.weight layer2.2.bn1.running_mean layer2.2.bn1.running_var layer2.2.bn1.weight layer2.2.bn1.bias layer2.2.conv2.weight layer2.2.bn2.running_mean layer2.2.bn2.running_var layer2.2.bn2.weight layer2.2.bn2.bias layer2.2.conv3.weight layer2.2.bn3.running_mean layer2.2.bn3.running_var layer2.2.bn3.weight layer2.2.bn3.bias layer2.3.conv1.weight layer2.3.bn1.running_mean layer2.3.bn1.running_var layer2.3.bn1.weight layer2.3.bn1.bias layer2.3.conv2.weight layer2.3.bn2.running_mean layer2.3.bn2.running_var layer2.3.bn2.weight layer2.3.bn2.bias layer2.3.conv3.weight layer2.3.bn3.running_mean layer2.3.bn3.running_var layer2.3.bn3.weight layer2.3.bn3.bias layer3.0.conv1.weight layer3.0.bn1.running_mean layer3.0.bn1.running_var layer3.0.bn1.weight layer3.0.bn1.bias layer3.0.conv2.weight layer3.0.bn2.running_mean layer3.0.bn2.running_var layer3.0.bn2.weight layer3.0.bn2.bias layer3.0.conv3.weight layer3.0.bn3.running_mean layer3.0.bn3.running_var layer3.0.bn3.weight layer3.0.bn3.bias layer3.0.downsample.0.weight layer3.0.downsample.1.running_mean layer3.0.downsample.1.running_var layer3.0.downsample.1.weight layer3.0.downsample.1.bias layer3.1.conv1.weight layer3.1.bn1.running_mean layer3.1.bn1.running_var layer3.1.bn1.weight layer3.1.bn1.bias layer3.1.conv2.weight layer3.1.bn2.running_mean layer3.1.bn2.running_var layer3.1.bn2.weight layer3.1.bn2.bias layer3.1.conv3.weight layer3.1.bn3.running_mean layer3.1.bn3.running_var layer3.1.bn3.weight layer3.1.bn3.bias layer3.2.conv1.weight layer3.2.bn1.running_mean layer3.2.bn1.running_var layer3.2.bn1.weight layer3.2.bn1.bias layer3.2.conv2.weight layer3.2.bn2.running_mean layer3.2.bn2.running_var layer3.2.bn2.weight layer3.2.bn2.bias layer3.2.conv3.weight layer3.2.bn3.running_mean layer3.2.bn3.running_var layer3.2.bn3.weight layer3.2.bn3.bias layer3.3.conv1.weight layer3.3.bn1.running_mean layer3.3.bn1.running_var layer3.3.bn1.weight layer3.3.bn1.bias layer3.3.conv2.weight layer3.3.bn2.running_mean layer3.3.bn2.running_var layer3.3.bn2.weight layer3.3.bn2.bias layer3.3.conv3.weight layer3.3.bn3.running_mean layer3.3.bn3.running_var layer3.3.bn3.weight layer3.3.bn3.bias layer3.4.conv1.weight layer3.4.bn1.running_mean layer3.4.bn1.running_var layer3.4.bn1.weight layer3.4.bn1.bias layer3.4.conv2.weight layer3.4.bn2.running_mean layer3.4.bn2.running_var layer3.4.bn2.weight layer3.4.bn2.bias layer3.4.conv3.weight layer3.4.bn3.running_mean layer3.4.bn3.running_var layer3.4.bn3.weight layer3.4.bn3.bias layer3.5.conv1.weight layer3.5.bn1.running_mean layer3.5.bn1.running_var layer3.5.bn1.weight layer3.5.bn1.bias layer3.5.conv2.weight layer3.5.bn2.running_mean layer3.5.bn2.running_var layer3.5.bn2.weight layer3.5.bn2.bias layer3.5.conv3.weight layer3.5.bn3.running_mean layer3.5.bn3.running_var layer3.5.bn3.weight layer3.5.bn3.bias layer3.6.conv1.weight layer3.6.bn1.running_mean layer3.6.bn1.running_var layer3.6.bn1.weight layer3.6.bn1.bias layer3.6.conv2.weight layer3.6.bn2.running_mean layer3.6.bn2.running_var layer3.6.bn2.weight layer3.6.bn2.bias layer3.6.conv3.weight layer3.6.bn3.running_mean layer3.6.bn3.running_var layer3.6.bn3.weight layer3.6.bn3.bias layer3.7.conv1.weight layer3.7.bn1.running_mean layer3.7.bn1.running_var layer3.7.bn1.weight layer3.7.bn1.bias layer3.7.conv2.weight layer3.7.bn2.running_mean layer3.7.bn2.running_var layer3.7.bn2.weight layer3.7.bn2.bias layer3.7.conv3.weight layer3.7.bn3.running_mean layer3.7.bn3.running_var layer3.7.bn3.weight layer3.7.bn3.bias layer3.8.conv1.weight layer3.8.bn1.running_mean layer3.8.bn1.running_var layer3.8.bn1.weight layer3.8.bn1.bias layer3.8.conv2.weight layer3.8.bn2.running_mean layer3.8.bn2.running_var layer3.8.bn2.weight layer3.8.bn2.bias layer3.8.conv3.weight layer3.8.bn3.running_mean layer3.8.bn3.running_var layer3.8.bn3.weight layer3.8.bn3.bias layer3.9.conv1.weight layer3.9.bn1.running_mean layer3.9.bn1.running_var layer3.9.bn1.weight layer3.9.bn1.bias layer3.9.conv2.weight layer3.9.bn2.running_mean layer3.9.bn2.running_var layer3.9.bn2.weight layer3.9.bn2.bias layer3.9.conv3.weight layer3.9.bn3.running_mean layer3.9.bn3.running_var layer3.9.bn3.weight layer3.9.bn3.bias layer3.10.conv1.weight layer3.10.bn1.running_mean layer3.10.bn1.running_var layer3.10.bn1.weight layer3.10.bn1.bias layer3.10.conv2.weight layer3.10.bn2.running_mean layer3.10.bn2.running_var layer3.10.bn2.weight layer3.10.bn2.bias layer3.10.conv3.weight layer3.10.bn3.running_mean layer3.10.bn3.running_var layer3.10.bn3.weight layer3.10.bn3.bias layer3.11.conv1.weight layer3.11.bn1.running_mean layer3.11.bn1.running_var layer3.11.bn1.weight layer3.11.bn1.bias layer3.11.conv2.weight layer3.11.bn2.running_mean layer3.11.bn2.running_var layer3.11.bn2.weight layer3.11.bn2.bias layer3.11.conv3.weight layer3.11.bn3.running_mean layer3.11.bn3.running_var layer3.11.bn3.weight layer3.11.bn3.bias layer3.12.conv1.weight layer3.12.bn1.running_mean layer3.12.bn1.running_var layer3.12.bn1.weight layer3.12.bn1.bias layer3.12.conv2.weight layer3.12.bn2.running_mean layer3.12.bn2.running_var layer3.12.bn2.weight layer3.12.bn2.bias layer3.12.conv3.weight layer3.12.bn3.running_mean layer3.12.bn3.running_var layer3.12.bn3.weight layer3.12.bn3.bias layer3.13.conv1.weight layer3.13.bn1.running_mean layer3.13.bn1.running_var layer3.13.bn1.weight layer3.13.bn1.bias layer3.13.conv2.weight layer3.13.bn2.running_mean layer3.13.bn2.running_var layer3.13.bn2.weight layer3.13.bn2.bias layer3.13.conv3.weight layer3.13.bn3.running_mean layer3.13.bn3.running_var layer3.13.bn3.weight layer3.13.bn3.bias layer3.14.conv1.weight layer3.14.bn1.running_mean layer3.14.bn1.running_var layer3.14.bn1.weight layer3.14.bn1.bias layer3.14.conv2.weight layer3.14.bn2.running_mean layer3.14.bn2.running_var layer3.14.bn2.weight layer3.14.bn2.bias layer3.14.conv3.weight layer3.14.bn3.running_mean layer3.14.bn3.running_var layer3.14.bn3.weight layer3.14.bn3.bias layer3.15.conv1.weight layer3.15.bn1.running_mean layer3.15.bn1.running_var layer3.15.bn1.weight layer3.15.bn1.bias layer3.15.conv2.weight layer3.15.bn2.running_mean layer3.15.bn2.running_var layer3.15.bn2.weight layer3.15.bn2.bias layer3.15.conv3.weight layer3.15.bn3.running_mean layer3.15.bn3.running_var layer3.15.bn3.weight layer3.15.bn3.bias layer3.16.conv1.weight layer3.16.bn1.running_mean layer3.16.bn1.running_var layer3.16.bn1.weight layer3.16.bn1.bias layer3.16.conv2.weight layer3.16.bn2.running_mean layer3.16.bn2.running_var layer3.16.bn2.weight layer3.16.bn2.bias layer3.16.conv3.weight layer3.16.bn3.running_mean layer3.16.bn3.running_var layer3.16.bn3.weight layer3.16.bn3.bias layer3.17.conv1.weight layer3.17.bn1.running_mean layer3.17.bn1.running_var layer3.17.bn1.weight layer3.17.bn1.bias layer3.17.conv2.weight layer3.17.bn2.running_mean layer3.17.bn2.running_var layer3.17.bn2.weight layer3.17.bn2.bias layer3.17.conv3.weight layer3.17.bn3.running_mean layer3.17.bn3.running_var layer3.17.bn3.weight layer3.17.bn3.bias layer3.18.conv1.weight layer3.18.bn1.running_mean layer3.18.bn1.running_var layer3.18.bn1.weight layer3.18.bn1.bias layer3.18.conv2.weight layer3.18.bn2.running_mean layer3.18.bn2.running_var layer3.18.bn2.weight layer3.18.bn2.bias layer3.18.conv3.weight layer3.18.bn3.running_mean layer3.18.bn3.running_var layer3.18.bn3.weight layer3.18.bn3.bias layer3.19.conv1.weight layer3.19.bn1.running_mean layer3.19.bn1.running_var layer3.19.bn1.weight layer3.19.bn1.bias layer3.19.conv2.weight layer3.19.bn2.running_mean layer3.19.bn2.running_var layer3.19.bn2.weight layer3.19.bn2.bias layer3.19.conv3.weight layer3.19.bn3.running_mean layer3.19.bn3.running_var layer3.19.bn3.weight layer3.19.bn3.bias layer3.20.conv1.weight layer3.20.bn1.running_mean layer3.20.bn1.running_var layer3.20.bn1.weight layer3.20.bn1.bias layer3.20.conv2.weight layer3.20.bn2.running_mean layer3.20.bn2.running_var layer3.20.bn2.weight layer3.20.bn2.bias layer3.20.conv3.weight layer3.20.bn3.running_mean layer3.20.bn3.running_var layer3.20.bn3.weight layer3.20.bn3.bias layer3.21.conv1.weight layer3.21.bn1.running_mean layer3.21.bn1.running_var layer3.21.bn1.weight layer3.21.bn1.bias layer3.21.conv2.weight layer3.21.bn2.running_mean layer3.21.bn2.running_var layer3.21.bn2.weight layer3.21.bn2.bias layer3.21.conv3.weight layer3.21.bn3.running_mean layer3.21.bn3.running_var layer3.21.bn3.weight layer3.21.bn3.bias layer3.22.conv1.weight layer3.22.bn1.running_mean layer3.22.bn1.running_var layer3.22.bn1.weight layer3.22.bn1.bias layer3.22.conv2.weight layer3.22.bn2.running_mean layer3.22.bn2.running_var layer3.22.bn2.weight layer3.22.bn2.bias layer3.22.conv3.weight layer3.22.bn3.running_mean layer3.22.bn3.running_var layer3.22.bn3.weight layer3.22.bn3.bias layer4.0.conv1.weight layer4.0.bn1.running_mean layer4.0.bn1.running_var layer4.0.bn1.weight layer4.0.bn1.bias layer4.0.conv2.weight layer4.0.bn2.running_mean layer4.0.bn2.running_var layer4.0.bn2.weight layer4.0.bn2.bias layer4.0.conv3.weight layer4.0.bn3.running_mean layer4.0.bn3.running_var layer4.0.bn3.weight layer4.0.bn3.bias layer4.0.downsample.0.weight layer4.0.downsample.1.running_mean layer4.0.downsample.1.running_var layer4.0.downsample.1.weight layer4.0.downsample.1.bias layer4.1.conv1.weight layer4.1.bn1.running_mean layer4.1.bn1.running_var layer4.1.bn1.weight layer4.1.bn1.bias layer4.1.conv2.weight layer4.1.bn2.running_mean layer4.1.bn2.running_var layer4.1.bn2.weight layer4.1.bn2.bias layer4.1.conv3.weight layer4.1.bn3.running_mean layer4.1.bn3.running_var layer4.1.bn3.weight layer4.1.bn3.bias layer4.2.conv1.weight layer4.2.bn1.running_mean layer4.2.bn1.running_var layer4.2.bn1.weight layer4.2.bn1.bias layer4.2.conv2.weight layer4.2.bn2.running_mean layer4.2.bn2.running_var layer4.2.bn2.weight layer4.2.bn2.bias layer4.2.conv3.weight layer4.2.bn3.running_mean layer4.2.bn3.running_var layer4.2.bn3.weight layer4.2.bn3.bias fc.weight fc.bias

    opened by Swt2000 4
  • hyper-parameters change

    hyper-parameters change

    Thank you for your work! How do you determine the hyper-parameters of epochs=20 and batchsize=16 used for training on the hcstvg2.0 dataset? Does changing these parameters have a big impact on performance? Have you tried experimental results with longer epochs?

    opened by Ryan-Wu-13 3
  • Any plan on applying it to Action tube detection

    Any plan on applying it to Action tube detection

    Hi great work!

    Thanks for sharing the code. Do you have any plan to apply it on the action tube detection problem? I guess we have to strip off text encoder.

    Best Gurkirt

    opened by gurkirt 3
  • Incorrect viou metric calculation

    Incorrect viou metric calculation

    Hi,

    I found a bug in viou metric calculation.

    Here, the max_end is min_end indeed. https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/hcstvg_eval.py#L120 https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/vidstg_eval.py#L116

    Then, the length of union_predgt is shorter. https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/hcstvg_eval.py#L137-L141

    Then, the calculated viou is much higher than the correct one. https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/hcstvg_eval.py#L181

    opened by zanglam 2
  • AssertionError: Caught AssertionError in DataLoader worker process 1.

    AssertionError: Caught AssertionError in DataLoader worker process 1.

    I run in 4*3090(24G), but the data in 200-300 seem error

    AssertionError: Caught AssertionError in DataLoader worker process 1. Original Traceback (most recent call last): File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop data = fetcher.fetch(index) File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 219, in getitem return self.datasets[dataset_idx][sample_idx] File "/home/Newdisk/zhangzp/TubeDETR/TubeDETR/datasets/vidstg.py", line 116, in getitem assert len(images_list) == len(frame_ids) AssertionError

    Killing subprocess 2844448 Killing subprocess 2844449 Killing subprocess 2844450 Killing subprocess 2844451 Traceback (most recent call last): File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/zhangzp/anaconda3/envs/tubedetr_env/bin/python', '-u', 'main.py', '--ema', '--load=pretrained_resnet101_checkpoint.pth', '--combine_datasets=vidstg', '--combine_datasets_val=vidstg', '--dataset_config', 'config/vidstg.json', '--output-dir=Vidstg_train']' returned non-zero exit status 1.

    opened by johnbager 1
  • Pretrained models' performance doesn't match the result

    Pretrained models' performance doesn't match the result

    Hi, I download the checkpoints pretrained on HC-STVG2.0, but the result is: viou:0.3555, [email protected]: 0.5675, [email protected]: 0.3000. I also find the loss is larger than 25, and the loss of the 0 epoch is almost 58. I have change the stride and resolution to match the checkpoints' training configuration. Did I miss something?

    opened by ykxixi 1
  • About m_sIoU

    About m_sIoU

    Hi, thank you for your excellent work! I have a question about the m_sIoU reported in your paper. We can estimate the spatial grounding accuracy inside the predicted time span (t_s, t_e) by calculating m_vIoU / m_tIoU. But I observed that in your model, m_sIoU << m_vIoU / m_tIoU (e.g., for HC-STVG2.0 with resolution 352 and temporal stride 4, m_sIoU =0.649, m_vIoU / m_tIoU = 0.467 / 0.539 = 0.866). It means that for the frames that are not in the predicted time span (t_s, t_e), the IoU between the predicted bounding boxes and the ground truth boxes is very low. This is quite interesting for me. Could you provide some analysis/explanations on it?

    opened by zanglam 1
  • Bump pillow from 9.0.1 to 9.3.0

    Bump pillow from 9.0.1 to 9.3.0

    Bumps pillow from 9.0.1 to 9.3.0.

    Release notes

    Sourced from pillow's releases.

    9.3.0

    https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

    Changes

    ... (truncated)

    Changelog

    Sourced from pillow's changelog.

    9.3.0 (2022-10-29)

    • Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

    • Initialize libtiff buffer when saving #6699 [radarhere]

    • Inline fname2char to fix memory leak #6329 [nulano]

    • Fix memory leaks related to text features #6330 [nulano]

    • Use double quotes for version check on old CPython on Windows #6695 [hugovk]

    • Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

    • Fixed set_variation_by_name offset #6445 [radarhere]

    • Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

    • Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

    • Added ExifTags enums #6630 [radarhere]

    • Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

    • Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

    • Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

    • Added GPS TIFF tag info #6661 [radarhere]

    • Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

    • Do not attempt normalization if mode is already normal #6644 [radarhere]

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump pillow from 8.4.0 to 9.0.1

    Bump pillow from 8.4.0 to 9.0.1

    Bumps pillow from 8.4.0 to 9.0.1.

    Release notes

    Sourced from pillow's releases.

    9.0.1

    https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html

    Changes

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@​radarhere, @​hugovk]
    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

    9.0.0

    https://pillow.readthedocs.io/en/stable/releasenotes/9.0.0.html

    Changes

    ... (truncated)

    Changelog

    Sourced from pillow's changelog.

    9.0.1 (2022-02-03)

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [radarhere, hugovk]

    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

    9.0.0 (2022-01-02)

    • Restrict builtins for ImageMath.eval(). CVE-2022-22817 #5923 [radarhere]

    • Ensure JpegImagePlugin stops at the end of a truncated file #5921 [radarhere]

    • Fixed ImagePath.Path array handling. CVE-2022-22815, CVE-2022-22816 #5920 [radarhere]

    • Remove consecutive duplicate tiles that only differ by their offset #5919 [radarhere]

    • Improved I;16 operations on big endian #5901 [radarhere]

    • Limit quantized palette to number of colors #5879 [radarhere]

    • Fixed palette index for zeroed color in FASTOCTREE quantize #5869 [radarhere]

    • When saving RGBA to GIF, make use of first transparent palette entry #5859 [radarhere]

    • Pass SAMPLEFORMAT to libtiff #5848 [radarhere]

    • Added rounding when converting P and PA #5824 [radarhere]

    • Improved putdata() documentation and data handling #5910 [radarhere]

    • Exclude carriage return in PDF regex to help prevent ReDoS #5912 [hugovk]

    • Fixed freeing pointer in ImageDraw.Outline.transform #5909 [radarhere]

    ... (truncated)

    Commits
    • 6deac9e 9.0.1 version bump
    • c04d812 Update CHANGES.rst [ci skip]
    • 4fabec3 Added release notes for 9.0.1
    • 02affaa Added delay after opening image with xdg-open
    • ca0b585 Updated formatting
    • 427221e In show_file, use os.remove to remove temporary images
    • c930be0 Restrict builtins within lambdas for ImageMath.eval
    • 75b69dd Dont need to pin for GHA
    • cd938a7 Autolink CWE numbers with sphinx-issues
    • 2e9c461 Add CVE IDs
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump numpy from 1.21.4 to 1.22.0

    Bump numpy from 1.21.4 to 1.22.0

    Bumps numpy from 1.21.4 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Problem about weight initialization using DDP

    Problem about weight initialization using DDP

    Hi Antoine It seems that you set a different seed for each rank before building the model. This may lead to different parameter initialization for different duplicate on each rank. Is it a mistake or a deliberate design?

    Here is a comment from pytorch lightning ddp advice

    Setting all the random seeds to the same value. This is important in a distributed training setting. Each rank will get its own set of initial weights. If they don't match up, the gradients will not match either, leading to training that may not converge.

    """starts from main.py line 347"""
    # fix the seed for reproducibility
    seed = args.seed + dist.get_rank()
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    # torch.set_deterministic(True)
    torch.use_deterministic_algorithms(True)
    
    # Build the model
    model, criterion, weight_dict = build_model(args)
    model.to(device)
    
    opened by K-Nick 2
  • Problem with dataset Download

    Problem with dataset Download

    Hello, there are many links of vidstg dataset that fail to work on Baidu. Part1, part2 and Part4 cannot be downloaded. Could you please send me a dataset there

    opened by Xiyu-AI 1
  • Training error in tubedetr.py file.

    Training error in tubedetr.py file.

    I try to train the network on HC-STVGv2 dataset using the command provided in the README.md file:

    python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --ema \                                                                                       
      2 --load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \                                                                  
      3 --v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=output --batch_size=8
    

    Unfortunately, I encountered this issue in models/tubedetr.py line 180

      File "/root/paddlejob/workspace/STVG/TubeDETR/models/tubedetr.py", line 180, in forward                                                                                 
        tpad_src = tpad_src.view(b * n_clips, f, h, w)                                                                                                                        
    RuntimeError: shape '[160, 256, 7, 12]' is invalid for input of size 2817024
    

    . Besides, the durations of the eight samples are: [100, 100, 69, 100, 65, 86, 100, 100].

    I think this problem is probably related to the padding approach. Do you have any clue with this BUG and how to fix it? Thank you very much!

    opened by OliverHxh 2
Owner
Antoine Yang
PhD Student in Computer Vision at Inria Paris
Antoine Yang
Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Spatio-Temporal Entropy Model A Pytorch Reproduction of Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression. More details can

null 16 Nov 28, 2022
Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

2D-TAN (Optimized) Introduction This is an optimized re-implementation repository for AAAI'2020 paper: Learning 2D Temporal Localization Networks for

Joya Chen 112 Dec 31, 2022
A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

Biomedical Computer Vision @ Uniandes 52 Dec 19, 2022
A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

One-Stage Visual Grounding ***** New: Our recent work on One-stage VG is available at ReSC.***** A Fast and Accurate One-Stage Approach to Visual Grou

Zhengyuan Yang 118 Dec 5, 2022
Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Temporally Efficient Vision Transformer for Video Instance Segmentation Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR

Hust Visual Learning Team 203 Dec 31, 2022
Deep generative modeling for time-stamped heterogeneous data, enabling high-fidelity models for a large variety of spatio-temporal domains.

Neural Spatio-Temporal Point Processes [arxiv] Ricky T. Q. Chen, Brandon Amos, Maximilian Nickel Abstract. We propose a new class of parameterizations

Facebook Research 75 Dec 19, 2022
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Hiring research interns for visual transformer

Multimedia Research 484 Dec 29, 2022
Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks

Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks Abstract Facial expression recognition in video

Bogireddy Sai Prasanna Teja Reddy 103 Dec 29, 2022
Implementation of the "PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences" paper.

PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences Introduction Point cloud sequences are irregular and unordered in the spatial dimen

Hehe Fan 63 Dec 9, 2022
Implementation of the "Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos" paper.

Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos Introduction Point cloud videos exhibit irregularities and lack of or

Hehe Fan 101 Dec 29, 2022
Code for the paper "Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds" (ICCV 2021)

Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds This is the official code implementation for the paper "Spatio-temporal Se

Hesper 63 Jan 5, 2023
Digital Twin Mobility Profiling: A Spatio-Temporal Graph Learning Approach

Digital Twin Mobility Profiling: A Spatio-Temporal Graph Learning Approach This is the implementation of traffic prediction code in DTMP based on PyTo

chenxin 1 Dec 19, 2021
DeepSTD: Mining Spatio-temporal Disturbances of Multiple Context Factors for Citywide Traffic Flow Prediction

DeepSTD: Mining Spatio-temporal Disturbances of Multiple Context Factors for Citywide Traffic Flow Prediction This is the implementation of DeepSTD in

null 5 Sep 26, 2022
Codes for TIM2021 paper "Anchor-Based Spatio-Temporal Attention 3-D Convolutional Networks for Dynamic 3-D Point Cloud Sequences"

Codes for TIM2021 paper "Anchor-Based Spatio-Temporal Attention 3-D Convolutional Networks for Dynamic 3-D Point Cloud Sequences"

Intelligent Robotics and Machine Vision Lab 4 Jul 19, 2022
Self-supervised spatio-spectro-temporal represenation learning for EEG analysis

EEG-Oriented Self-Supervised Learning and Cluster-Aware Adaptation This repository provides a tensorflow implementation of a submitted paper: EEG-Orie

Wonjun Ko 4 Jun 9, 2022
[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

Qin Wang 87 Jan 8, 2023
The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

null 9 Nov 14, 2022
Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Zhengzhong Tu 5 Sep 16, 2022
Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

This repository is the official PyTorch implementation of Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

hippopmonkey 4 Dec 11, 2022