[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Antoine Yang

Last update: Dec 27, 2022

Related tags

Deep Learning video-understanding multimodal-learning vision-and-language visual-grounding spatio-temporal-video-grounding stvg vidstg hc-stvg

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Website • STVG Demo • Paper

This repository provides the code for our paper. This includes:

Software setup, data downloading and preprocessing instructions for the VidSTG, HC-STVG1 and HC-STVG2.0 datasets
Training scripts and pretrained checkpoints
Evaluation scripts and demo

Setup

Download FFMPEG and add it to the PATH environment variable. The code was tested with version ffmpeg-4.2.2-amd64-static. Then create a conda environment and install the requirements with the following commands:

conda create -n tubedetr_env python=3.8
conda activate tubedetr_env
pip install -r requirements.txt

Data Downloading

Setup the paths where you are going to download videos and annotations in the config json files.

VidSTG: Download VidOR videos and annotations from the VidOR dataset providers. Then download the VidSTG annotations from the VidSTG dataset providers. The vidstg_vid_path folder should contain a folder video containing the unzipped video folders. The vidstg_ann_path folder should contain both VidOR and VidSTG annotations.

HC-STVG: Download HC-STVG1 and HC-STVG2.0 videos and annotations from the HC-STVG dataset providers. The hcstvg_vid_path folder should contain a folder video containing the unzipped video folders. The hcstvg_ann_path folder should contain both HC-STVG1 and HC-STVG2.0 annotations.

Data Preprocessing

To preprocess annotation files, run:

python preproc/preproc_vidstg.py
python preproc/preproc_hcstvg.py
python preproc/preproc_hcstvgv2.py

Training

Download pretrained RoBERTa tokenizer and model weights in the TRANSFORMERS_CACHE folder. Download pretrained ResNet-101 model weights in the TORCH_HOME folder. Download MDETR pretrained model weights with ResNet-101 backbone in the current folder.

VidSTG To train on VidSTG, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=vidstg --combine_datasets_val=vidstg \
--dataset_config config/vidstg.json --output-dir=OUTPUT_DIR

HC-STVG2.0 To train on HC-STVG2.0, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=OUTPUT_DIR

HC-STVG1 To train on HC-STVG1, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--dataset_config config/hcstvg.json --epochs=40 --eval_skip=40 --output-dir=OUTPUT_DIR

Baselines

To remove time encoding, add --no_time_embed.
To remove the temporal self-attention in the space-time decoder, add --no_tsa.
To train from ImageNet initialization, pass an empty string to the argument --load and add --sted_loss_coef=5 --lr=2e-5 --text_encoder_lr=2e-5 --epochs=20 --lr_drop=20 for VidSTG or --epochs=60 --lr_drop=60 for HC-STVG1.
To train with a randomly initalized temporal self-attention, add --rd_init_tsa.
To train with a different spatial resolution (e.g. res=352) or temporal stride (e.g. k=4), add --resolution=224 or --stride=5.
To train with the slow-only variant, add --no_fast.
To train with alternative designs for the fast branch, add --fast=VARIANT.

Available Checkpoints

Training data	parameters	url	size
MDETR init + VidSTG	k=4 res=352	Drive	3.0GB
MDETR init + VidSTG	k=2 res=224	Drive	3.0GB
ImageNet init + VidSTG	k=4 res=352	Drive	3.0GB
MDETR init + HC-STVG2.0	k=4 res=352	Drive	3.0GB
MDETR init + HC-STVG2.0	k=2 res=224	Drive	3.0GB
MDETR init + HC-STVG1	k=4 res=352	Drive	3.0GB
ImageNet init + HC-STVG1	k=4 res=352	Drive	3.0GB

Evaluation

For evaluation only, simply run the same commands as for training with --resume=CHECKPOINT --eval. For this to be done on the test set, add --test (in this case predictions and attention weights are also saved).

Spatio-Temporal Video Grounding Demo

You can also use a pretrained model to infer a spatio-temporal tube on a video of your choice (VIDEO_PATH with potential START and END timestamps) given the natural language query of your choice (CAPTION) with the following command:

python demo_stvg.py --load=CHECKPOINT --caption_example CAPTION --video_example VIDEO_PATH --start_example=START --end_example=END --output-dir OUTPUT_PATH

Note that we also host an online demo at this link, the code of which is available at server_stvg.py and server_stvg.html.

Acknowledgements

This codebase is built on the MDETR codebase. The code for video spatial data augmentation is inspired by torch_videovision.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yang2022tubedetr,
title={TubeDETR: Spatio-Temporal Video Grounding with Transformers},
author={Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}}

Comments

KeyError:'model' in main.py 552

I downloaded the checkpoint file from the download link found on the pytorch official website according to the instructions of the readme file. After importing, I did not find the key——"model" or "model_ema" for checkpoint. The download link is https://download.pytorch.org/models/resnet101-63fe2227.pth

The checkpoint output is： conv1.weight bn1.running_mean bn1.running_var bn1.weight bn1.bias layer1.0.conv1.weight layer1.0.bn1.running_mean layer1.0.bn1.running_var layer1.0.bn1.weight layer1.0.bn1.bias layer1.0.conv2.weight layer1.0.bn2.running_mean layer1.0.bn2.running_var layer1.0.bn2.weight layer1.0.bn2.bias layer1.0.conv3.weight layer1.0.bn3.running_mean layer1.0.bn3.running_var layer1.0.bn3.weight layer1.0.bn3.bias layer1.0.downsample.0.weight layer1.0.downsample.1.running_mean layer1.0.downsample.1.running_var layer1.0.downsample.1.weight layer1.0.downsample.1.bias layer1.1.conv1.weight layer1.1.bn1.running_mean layer1.1.bn1.running_var layer1.1.bn1.weight layer1.1.bn1.bias layer1.1.conv2.weight layer1.1.bn2.running_mean layer1.1.bn2.running_var layer1.1.bn2.weight layer1.1.bn2.bias layer1.1.conv3.weight layer1.1.bn3.running_mean layer1.1.bn3.running_var layer1.1.bn3.weight layer1.1.bn3.bias layer1.2.conv1.weight layer1.2.bn1.running_mean layer1.2.bn1.running_var layer1.2.bn1.weight layer1.2.bn1.bias layer1.2.conv2.weight layer1.2.bn2.running_mean layer1.2.bn2.running_var layer1.2.bn2.weight layer1.2.bn2.bias layer1.2.conv3.weight layer1.2.bn3.running_mean layer1.2.bn3.running_var layer1.2.bn3.weight layer1.2.bn3.bias layer2.0.conv1.weight layer2.0.bn1.running_mean layer2.0.bn1.running_var layer2.0.bn1.weight layer2.0.bn1.bias layer2.0.conv2.weight layer2.0.bn2.running_mean layer2.0.bn2.running_var layer2.0.bn2.weight layer2.0.bn2.bias layer2.0.conv3.weight layer2.0.bn3.running_mean layer2.0.bn3.running_var layer2.0.bn3.weight layer2.0.bn3.bias layer2.0.downsample.0.weight layer2.0.downsample.1.running_mean layer2.0.downsample.1.running_var layer2.0.downsample.1.weight layer2.0.downsample.1.bias layer2.1.conv1.weight layer2.1.bn1.running_mean layer2.1.bn1.running_var layer2.1.bn1.weight layer2.1.bn1.bias layer2.1.conv2.weight layer2.1.bn2.running_mean layer2.1.bn2.running_var layer2.1.bn2.weight layer2.1.bn2.bias layer2.1.conv3.weight layer2.1.bn3.running_mean layer2.1.bn3.running_var layer2.1.bn3.weight layer2.1.bn3.bias layer2.2.conv1.weight layer2.2.bn1.running_mean layer2.2.bn1.running_var layer2.2.bn1.weight layer2.2.bn1.bias layer2.2.conv2.weight layer2.2.bn2.running_mean layer2.2.bn2.running_var layer2.2.bn2.weight layer2.2.bn2.bias layer2.2.conv3.weight layer2.2.bn3.running_mean layer2.2.bn3.running_var layer2.2.bn3.weight layer2.2.bn3.bias layer2.3.conv1.weight layer2.3.bn1.running_mean layer2.3.bn1.running_var layer2.3.bn1.weight layer2.3.bn1.bias layer2.3.conv2.weight layer2.3.bn2.running_mean layer2.3.bn2.running_var layer2.3.bn2.weight layer2.3.bn2.bias layer2.3.conv3.weight layer2.3.bn3.running_mean layer2.3.bn3.running_var layer2.3.bn3.weight layer2.3.bn3.bias layer3.0.conv1.weight layer3.0.bn1.running_mean layer3.0.bn1.running_var layer3.0.bn1.weight layer3.0.bn1.bias layer3.0.conv2.weight layer3.0.bn2.running_mean layer3.0.bn2.running_var layer3.0.bn2.weight layer3.0.bn2.bias layer3.0.conv3.weight layer3.0.bn3.running_mean layer3.0.bn3.running_var layer3.0.bn3.weight layer3.0.bn3.bias layer3.0.downsample.0.weight layer3.0.downsample.1.running_mean layer3.0.downsample.1.running_var layer3.0.downsample.1.weight layer3.0.downsample.1.bias layer3.1.conv1.weight layer3.1.bn1.running_mean layer3.1.bn1.running_var layer3.1.bn1.weight layer3.1.bn1.bias layer3.1.conv2.weight layer3.1.bn2.running_mean layer3.1.bn2.running_var layer3.1.bn2.weight layer3.1.bn2.bias layer3.1.conv3.weight layer3.1.bn3.running_mean layer3.1.bn3.running_var layer3.1.bn3.weight layer3.1.bn3.bias layer3.2.conv1.weight layer3.2.bn1.running_mean layer3.2.bn1.running_var layer3.2.bn1.weight layer3.2.bn1.bias layer3.2.conv2.weight layer3.2.bn2.running_mean layer3.2.bn2.running_var layer3.2.bn2.weight layer3.2.bn2.bias layer3.2.conv3.weight layer3.2.bn3.running_mean layer3.2.bn3.running_var layer3.2.bn3.weight layer3.2.bn3.bias layer3.3.conv1.weight layer3.3.bn1.running_mean layer3.3.bn1.running_var layer3.3.bn1.weight layer3.3.bn1.bias layer3.3.conv2.weight layer3.3.bn2.running_mean layer3.3.bn2.running_var layer3.3.bn2.weight layer3.3.bn2.bias layer3.3.conv3.weight layer3.3.bn3.running_mean layer3.3.bn3.running_var layer3.3.bn3.weight layer3.3.bn3.bias layer3.4.conv1.weight layer3.4.bn1.running_mean layer3.4.bn1.running_var layer3.4.bn1.weight layer3.4.bn1.bias layer3.4.conv2.weight layer3.4.bn2.running_mean layer3.4.bn2.running_var layer3.4.bn2.weight layer3.4.bn2.bias layer3.4.conv3.weight layer3.4.bn3.running_mean layer3.4.bn3.running_var layer3.4.bn3.weight layer3.4.bn3.bias layer3.5.conv1.weight layer3.5.bn1.running_mean layer3.5.bn1.running_var layer3.5.bn1.weight layer3.5.bn1.bias layer3.5.conv2.weight layer3.5.bn2.running_mean layer3.5.bn2.running_var layer3.5.bn2.weight layer3.5.bn2.bias layer3.5.conv3.weight layer3.5.bn3.running_mean layer3.5.bn3.running_var layer3.5.bn3.weight layer3.5.bn3.bias layer3.6.conv1.weight layer3.6.bn1.running_mean layer3.6.bn1.running_var layer3.6.bn1.weight layer3.6.bn1.bias layer3.6.conv2.weight layer3.6.bn2.running_mean layer3.6.bn2.running_var layer3.6.bn2.weight layer3.6.bn2.bias layer3.6.conv3.weight layer3.6.bn3.running_mean layer3.6.bn3.running_var layer3.6.bn3.weight layer3.6.bn3.bias layer3.7.conv1.weight layer3.7.bn1.running_mean layer3.7.bn1.running_var layer3.7.bn1.weight layer3.7.bn1.bias layer3.7.conv2.weight layer3.7.bn2.running_mean layer3.7.bn2.running_var layer3.7.bn2.weight layer3.7.bn2.bias layer3.7.conv3.weight layer3.7.bn3.running_mean layer3.7.bn3.running_var layer3.7.bn3.weight layer3.7.bn3.bias layer3.8.conv1.weight layer3.8.bn1.running_mean layer3.8.bn1.running_var layer3.8.bn1.weight layer3.8.bn1.bias layer3.8.conv2.weight layer3.8.bn2.running_mean layer3.8.bn2.running_var layer3.8.bn2.weight layer3.8.bn2.bias layer3.8.conv3.weight layer3.8.bn3.running_mean layer3.8.bn3.running_var layer3.8.bn3.weight layer3.8.bn3.bias layer3.9.conv1.weight layer3.9.bn1.running_mean layer3.9.bn1.running_var layer3.9.bn1.weight layer3.9.bn1.bias layer3.9.conv2.weight layer3.9.bn2.running_mean layer3.9.bn2.running_var layer3.9.bn2.weight layer3.9.bn2.bias layer3.9.conv3.weight layer3.9.bn3.running_mean layer3.9.bn3.running_var layer3.9.bn3.weight layer3.9.bn3.bias layer3.10.conv1.weight layer3.10.bn1.running_mean layer3.10.bn1.running_var layer3.10.bn1.weight layer3.10.bn1.bias layer3.10.conv2.weight layer3.10.bn2.running_mean layer3.10.bn2.running_var layer3.10.bn2.weight layer3.10.bn2.bias layer3.10.conv3.weight layer3.10.bn3.running_mean layer3.10.bn3.running_var layer3.10.bn3.weight layer3.10.bn3.bias layer3.11.conv1.weight layer3.11.bn1.running_mean layer3.11.bn1.running_var layer3.11.bn1.weight layer3.11.bn1.bias layer3.11.conv2.weight layer3.11.bn2.running_mean layer3.11.bn2.running_var layer3.11.bn2.weight layer3.11.bn2.bias layer3.11.conv3.weight layer3.11.bn3.running_mean layer3.11.bn3.running_var layer3.11.bn3.weight layer3.11.bn3.bias layer3.12.conv1.weight layer3.12.bn1.running_mean layer3.12.bn1.running_var layer3.12.bn1.weight layer3.12.bn1.bias layer3.12.conv2.weight layer3.12.bn2.running_mean layer3.12.bn2.running_var layer3.12.bn2.weight layer3.12.bn2.bias layer3.12.conv3.weight layer3.12.bn3.running_mean layer3.12.bn3.running_var layer3.12.bn3.weight layer3.12.bn3.bias layer3.13.conv1.weight layer3.13.bn1.running_mean layer3.13.bn1.running_var layer3.13.bn1.weight layer3.13.bn1.bias layer3.13.conv2.weight layer3.13.bn2.running_mean layer3.13.bn2.running_var layer3.13.bn2.weight layer3.13.bn2.bias layer3.13.conv3.weight layer3.13.bn3.running_mean layer3.13.bn3.running_var layer3.13.bn3.weight layer3.13.bn3.bias layer3.14.conv1.weight layer3.14.bn1.running_mean layer3.14.bn1.running_var layer3.14.bn1.weight layer3.14.bn1.bias layer3.14.conv2.weight layer3.14.bn2.running_mean layer3.14.bn2.running_var layer3.14.bn2.weight layer3.14.bn2.bias layer3.14.conv3.weight layer3.14.bn3.running_mean layer3.14.bn3.running_var layer3.14.bn3.weight layer3.14.bn3.bias layer3.15.conv1.weight layer3.15.bn1.running_mean layer3.15.bn1.running_var layer3.15.bn1.weight layer3.15.bn1.bias layer3.15.conv2.weight layer3.15.bn2.running_mean layer3.15.bn2.running_var layer3.15.bn2.weight layer3.15.bn2.bias layer3.15.conv3.weight layer3.15.bn3.running_mean layer3.15.bn3.running_var layer3.15.bn3.weight layer3.15.bn3.bias layer3.16.conv1.weight layer3.16.bn1.running_mean layer3.16.bn1.running_var layer3.16.bn1.weight layer3.16.bn1.bias layer3.16.conv2.weight layer3.16.bn2.running_mean layer3.16.bn2.running_var layer3.16.bn2.weight layer3.16.bn2.bias layer3.16.conv3.weight layer3.16.bn3.running_mean layer3.16.bn3.running_var layer3.16.bn3.weight layer3.16.bn3.bias layer3.17.conv1.weight layer3.17.bn1.running_mean layer3.17.bn1.running_var layer3.17.bn1.weight layer3.17.bn1.bias layer3.17.conv2.weight layer3.17.bn2.running_mean layer3.17.bn2.running_var layer3.17.bn2.weight layer3.17.bn2.bias layer3.17.conv3.weight layer3.17.bn3.running_mean layer3.17.bn3.running_var layer3.17.bn3.weight layer3.17.bn3.bias layer3.18.conv1.weight layer3.18.bn1.running_mean layer3.18.bn1.running_var layer3.18.bn1.weight layer3.18.bn1.bias layer3.18.conv2.weight layer3.18.bn2.running_mean layer3.18.bn2.running_var layer3.18.bn2.weight layer3.18.bn2.bias layer3.18.conv3.weight layer3.18.bn3.running_mean layer3.18.bn3.running_var layer3.18.bn3.weight layer3.18.bn3.bias layer3.19.conv1.weight layer3.19.bn1.running_mean layer3.19.bn1.running_var layer3.19.bn1.weight layer3.19.bn1.bias layer3.19.conv2.weight layer3.19.bn2.running_mean layer3.19.bn2.running_var layer3.19.bn2.weight layer3.19.bn2.bias layer3.19.conv3.weight layer3.19.bn3.running_mean layer3.19.bn3.running_var layer3.19.bn3.weight layer3.19.bn3.bias layer3.20.conv1.weight layer3.20.bn1.running_mean layer3.20.bn1.running_var layer3.20.bn1.weight layer3.20.bn1.bias layer3.20.conv2.weight layer3.20.bn2.running_mean layer3.20.bn2.running_var layer3.20.bn2.weight layer3.20.bn2.bias layer3.20.conv3.weight layer3.20.bn3.running_mean layer3.20.bn3.running_var layer3.20.bn3.weight layer3.20.bn3.bias layer3.21.conv1.weight layer3.21.bn1.running_mean layer3.21.bn1.running_var layer3.21.bn1.weight layer3.21.bn1.bias layer3.21.conv2.weight layer3.21.bn2.running_mean layer3.21.bn2.running_var layer3.21.bn2.weight layer3.21.bn2.bias layer3.21.conv3.weight layer3.21.bn3.running_mean layer3.21.bn3.running_var layer3.21.bn3.weight layer3.21.bn3.bias layer3.22.conv1.weight layer3.22.bn1.running_mean layer3.22.bn1.running_var layer3.22.bn1.weight layer3.22.bn1.bias layer3.22.conv2.weight layer3.22.bn2.running_mean layer3.22.bn2.running_var layer3.22.bn2.weight layer3.22.bn2.bias layer3.22.conv3.weight layer3.22.bn3.running_mean layer3.22.bn3.running_var layer3.22.bn3.weight layer3.22.bn3.bias layer4.0.conv1.weight layer4.0.bn1.running_mean layer4.0.bn1.running_var layer4.0.bn1.weight layer4.0.bn1.bias layer4.0.conv2.weight layer4.0.bn2.running_mean layer4.0.bn2.running_var layer4.0.bn2.weight layer4.0.bn2.bias layer4.0.conv3.weight layer4.0.bn3.running_mean layer4.0.bn3.running_var layer4.0.bn3.weight layer4.0.bn3.bias layer4.0.downsample.0.weight layer4.0.downsample.1.running_mean layer4.0.downsample.1.running_var layer4.0.downsample.1.weight layer4.0.downsample.1.bias layer4.1.conv1.weight layer4.1.bn1.running_mean layer4.1.bn1.running_var layer4.1.bn1.weight layer4.1.bn1.bias layer4.1.conv2.weight layer4.1.bn2.running_mean layer4.1.bn2.running_var layer4.1.bn2.weight layer4.1.bn2.bias layer4.1.conv3.weight layer4.1.bn3.running_mean layer4.1.bn3.running_var layer4.1.bn3.weight layer4.1.bn3.bias layer4.2.conv1.weight layer4.2.bn1.running_mean layer4.2.bn1.running_var layer4.2.bn1.weight layer4.2.bn1.bias layer4.2.conv2.weight layer4.2.bn2.running_mean layer4.2.bn2.running_var layer4.2.bn2.weight layer4.2.bn2.bias layer4.2.conv3.weight layer4.2.bn3.running_mean layer4.2.bn3.running_var layer4.2.bn3.weight layer4.2.bn3.bias fc.weight fc.bias

opened by Swt2000 4
hyper-parameters change

Thank you for your work！ How do you determine the hyper-parameters of epochs=20 and batchsize=16 used for training on the hcstvg2.0 dataset? Does changing these parameters have a big impact on performance? Have you tried experimental results with longer epochs?

opened by Ryan-Wu-13 3
Any plan on applying it to Action tube detection

Hi great work!

Thanks for sharing the code. Do you have any plan to apply it on the action tube detection problem? I guess we have to strip off text encoder.

Best Gurkirt

opened by gurkirt 3
Incorrect viou metric calculation

Hi,

I found a bug in viou metric calculation.

Here, the max_end is min_end indeed. https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/hcstvg_eval.py#L120 https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/vidstg_eval.py#L116

Then, the length of union_predgt is shorter. https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/hcstvg_eval.py#L137-L141

Then, the calculated viou is much higher than the correct one. https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/hcstvg_eval.py#L181

opened by zanglam 2
AssertionError: Caught AssertionError in DataLoader worker process 1.

I run in 4*3090(24G), but the data in 200-300 seem error

AssertionError: Caught AssertionError in DataLoader worker process 1. Original Traceback (most recent call last): File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop data = fetcher.fetch(index) File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 219, in getitem return self.datasets[dataset_idx][sample_idx] File "/home/Newdisk/zhangzp/TubeDETR/TubeDETR/datasets/vidstg.py", line 116, in getitem assert len(images_list) == len(frame_ids) AssertionError

Killing subprocess 2844448 Killing subprocess 2844449 Killing subprocess 2844450 Killing subprocess 2844451 Traceback (most recent call last): File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/zhangzp/anaconda3/envs/tubedetr_env/bin/python', '-u', 'main.py', '--ema', '--load=pretrained_resnet101_checkpoint.pth', '--combine_datasets=vidstg', '--combine_datasets_val=vidstg', '--dataset_config', 'config/vidstg.json', '--output-dir=Vidstg_train']' returned non-zero exit status 1.

opened by johnbager 1
Pretrained models' performance doesn't match the result

Hi, I download the checkpoints pretrained on HC-STVG2.0, but the result is: viou：0.3555, [email protected]: 0.5675, [email protected]: 0.3000. I also find the loss is larger than 25, and the loss of the 0 epoch is almost 58. I have change the stride and resolution to match the checkpoints' training configuration. Did I miss something?

opened by ykxixi 1
About m_sIoU

Hi, thank you for your excellent work! I have a question about the m_sIoU reported in your paper. We can estimate the spatial grounding accuracy inside the predicted time span (t_s, t_e) by calculating m_vIoU / m_tIoU. But I observed that in your model, m_sIoU << m_vIoU / m_tIoU (e.g., for HC-STVG2.0 with resolution 352 and temporal stride 4, m_sIoU =0.649, m_vIoU / m_tIoU = 0.467 / 0.539 = 0.866). It means that for the frames that are not in the predicted time span (t_s, t_e), the IoU between the predicted bounding boxes and the ground truth boxes is very low. This is quite interesting for me. Could you provide some analysis/explanations on it?

opened by zanglam 1
Bump pillow from 9.0.1 to 9.3.0
Bumps pillow from 9.0.1 to 9.3.0.

Release notes

Sourced from pillow's releases.

9.3.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

Changes

Initialize libtiff buffer when saving #6699 [@radarhere]

Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [@wiredfool]

Inline fname2char to fix memory leak #6329 [@nulano]

Fix memory leaks related to text features #6330 [@nulano]

Use double quotes for version check on old CPython on Windows #6695 [@hugovk]

GHA: replace deprecated set-output command with GITHUB_OUTPUT file #6697 [@nulano]

Remove backup implementation of Round for Windows platforms #6693 [@cgohlke]

Upload fribidi.dll to GitHub Actions #6532 [@nulano]

Fixed set_variation_by_name offset #6445 [@radarhere]

Windows build improvements #6562 [@nulano]

Fix malloc in _imagingft.c:font_setvaraxes #6690 [@cgohlke]

Only use ASCII characters in C source file #6691 [@cgohlke]

Release Python GIL when converting images using matrix operations #6418 [@hmaarrfk]

Added ExifTags enums #6630 [@radarhere]

Do not modify previous frame when calculating delta in PNG #6683 [@radarhere]

Added support for reading BMP images with RLE4 compression #6674 [@npjg]

Decode JPEG compressed BLP1 data in original mode #6678 [@radarhere]

pylint warnings #6659 [@marksmayo]

Added GPS TIFF tag info #6661 [@radarhere]

Added conversion between RGB/RGBA/RGBX and LAB #6647 [@radarhere]

Do not attempt normalization if mode is already normal #6644 [@radarhere]

Fixed seeking to an L frame in a GIF #6576 [@radarhere]

Consider all frames when selecting mode for PNG save_all #6610 [@radarhere]

Don't reassign crc on ChunkStream close #6627 [@radarhere]

Raise a warning if NumPy failed to raise an error during conversion #6594 [@radarhere]

Only read a maximum of 100 bytes at a time in IMT header #6623 [@radarhere]

Show all frames in ImageShow #6611 [@radarhere]

Allow FLI palette chunk to not be first #6626 [@radarhere]

If first GIF frame has transparency for RGB_ALWAYS loading strategy, use RGBA mode #6592 [@radarhere]

Round box position to integer when pasting embedded color #6517 [@radarhere]

Removed EXIF prefix when saving WebP #6582 [@radarhere]

Pad IM palette to 768 bytes when saving #6579 [@radarhere]

Added DDS BC6H reading #6449 [@ShadelessFox]

Added support for opening WhiteIsZero 16-bit integer TIFF images #6642 [@JayWiz]

Raise an error when allocating translucent color to RGB palette #6654 [@jsbueno]

Moved mode check outside of loops #6650 [@radarhere]

Added reading of TIFF child images #6569 [@radarhere]

Improved ImageOps palette handling #6596 [@PososikTeam]

Defer parsing of palette into colors #6567 [@radarhere]

Apply transparency to P images in ImageTk.PhotoImage #6559 [@radarhere]

Use rounding in ImageOps contain() and pad() #6522 [@bibinhashley]

Fixed GIF remapping to palette with duplicate entries #6548 [@radarhere]

Allow remap_palette() to return an image with less than 256 palette entries #6543 [@radarhere]

Corrected BMP and TGA palette size when saving #6500 [@radarhere]

... (truncated)

Changelog

Sourced from pillow's changelog.

9.3.0 (2022-10-29)

Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

Initialize libtiff buffer when saving #6699 [radarhere]

Inline fname2char to fix memory leak #6329 [nulano]

Fix memory leaks related to text features #6330 [nulano]

Use double quotes for version check on old CPython on Windows #6695 [hugovk]

Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

Fixed set_variation_by_name offset #6445 [radarhere]

Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

Added ExifTags enums #6630 [radarhere]

Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

Added GPS TIFF tag info #6661 [radarhere]

Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

Do not attempt normalization if mode is already normal #6644 [radarhere]

... (truncated)

Commits

d594f4c Update CHANGES.rst [ci skip]

909dc64 9.3.0 version bump

1a51ce7 Merge pull request #6699 from hugovk/security-libtiff_buffer

2444cdd Merge pull request #6700 from hugovk/security-samples_per_pixel-sec

744f455 Added release notes

0846bfa Add to release notes

799a6a0 Fix linting

00b25fd Hide UserWarning in logs

05b175e Tighter test case

13f2c5a Prevent DOS with large SAMPLESPERPIXEL in Tiff IFD

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump pillow from 8.4.0 to 9.0.1
Bumps pillow from 8.4.0 to 9.0.1.

Release notes

Sourced from pillow's releases.

9.0.1

https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html

Changes

In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@radarhere, @hugovk]

Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

9.0.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.0.0.html

Changes

Restrict builtins for ImageMath.eval() #5923 [@radarhere]

Ensure JpegImagePlugin stops at the end of a truncated file #5921 [@radarhere]

Fixed ImagePath.Path array handling #5920 [@radarhere]

Remove consecutive duplicate tiles that only differ by their offset #5919 [@radarhere]

Removed redundant part of condition #5915 [@radarhere]

Explicitly enable strip chopping for large uncompressed TIFFs #5517 [@kmilos]

Use the Windows method to get TCL functions on Cygwin #5807 [@DWesl]

Changed error type to allow for incremental WebP parsing #5404 [@radarhere]

Improved I;16 operations on big endian #5901 [@radarhere]

Ensure that BMP pixel data offset does not ignore palette #5899 [@radarhere]

Limit quantized palette to number of colors #5879 [@radarhere]

Use latin1 encoding to decode bytes #5870 [@radarhere]

Fixed palette index for zeroed color in FASTOCTREE quantize #5869 [@radarhere]

When saving RGBA to GIF, make use of first transparent palette entry #5859 [@radarhere]

Pass SAMPLEFORMAT to libtiff #5848 [@radarhere]

Added rounding when converting P and PA #5824 [@radarhere]

Improved putdata() documentation and data handling #5910 [@radarhere]

Exclude carriage return in PDF regex to help prevent ReDoS #5912 [@radarhere]

Image.NONE is only used for resampling and dithers #5908 [@radarhere]

Fixed freeing pointer in ImageDraw.Outline.transform #5909 [@radarhere]

Add Tidelift alignment action and badge #5763 [@aclark4life]

Replaced further direct invocations of setup.py #5906 [@radarhere]

Added ImageShow support for xdg-open #5897 [@m-shinder]

Fixed typo #5902 [@radarhere]

Switched from deprecated "setup.py install" to "pip install ." #5896 [@radarhere]

Support 16-bit grayscale ImageQt conversion #5856 [@cmbruns]

Fixed raising OSError in _safe_read when size is greater than SAFEBLOCK #5872 [@radarhere]

Convert subsequent GIF frames to RGB or RGBA #5857 [@radarhere]

WebP: Fix memory leak during decoding on failure #5798 [@ilai-deutel]

Do not prematurely return in ImageFile when saving to stdout #5665 [@infmagic2047]

Added support for top right and bottom right TGA orientations #5829 [@radarhere]

Corrected ICNS file length in header #5845 [@radarhere]

Block tile TIFF tags when saving #5839 [@radarhere]

Added line width argument to ImageDraw polygon #5694 [@radarhere]

Do not redeclare class each time when converting to NumPy #5844 [@radarhere]

Only prevent repeated polygon pixels when drawing with transparency #5835 [@radarhere]

... (truncated)

Changelog

Sourced from pillow's changelog.

9.0.1 (2022-02-03)

In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [radarhere, hugovk]

Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

9.0.0 (2022-01-02)

Restrict builtins for ImageMath.eval(). CVE-2022-22817 #5923 [radarhere]

Ensure JpegImagePlugin stops at the end of a truncated file #5921 [radarhere]

Fixed ImagePath.Path array handling. CVE-2022-22815, CVE-2022-22816 #5920 [radarhere]

Remove consecutive duplicate tiles that only differ by their offset #5919 [radarhere]

Improved I;16 operations on big endian #5901 [radarhere]

Limit quantized palette to number of colors #5879 [radarhere]

Fixed palette index for zeroed color in FASTOCTREE quantize #5869 [radarhere]

When saving RGBA to GIF, make use of first transparent palette entry #5859 [radarhere]

Pass SAMPLEFORMAT to libtiff #5848 [radarhere]

Added rounding when converting P and PA #5824 [radarhere]

Improved putdata() documentation and data handling #5910 [radarhere]

Exclude carriage return in PDF regex to help prevent ReDoS #5912 [hugovk]

Fixed freeing pointer in ImageDraw.Outline.transform #5909 [radarhere]

... (truncated)

Commits

6deac9e 9.0.1 version bump

c04d812 Update CHANGES.rst [ci skip]

4fabec3 Added release notes for 9.0.1

02affaa Added delay after opening image with xdg-open

ca0b585 Updated formatting

427221e In show_file, use os.remove to remove temporary images

c930be0 Restrict builtins within lambdas for ImageMath.eval

75b69dd Dont need to pin for GHA

cd938a7 Autolink CWE numbers with sphinx-issues

2e9c461 Add CVE IDs

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump numpy from 1.21.4 to 1.22.0
Bumps numpy from 1.21.4 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Problem about weight initialization using DDP
Hi Antoine It seems that you set a different seed for each rank before building the model. This may lead to different parameter initialization for different duplicate on each rank. Is it a mistake or a deliberate design?

Here is a comment from pytorch lightning ddp advice

Setting all the random seeds to the same value. This is important in a distributed training setting. Each rank will get its own set of initial weights. If they don't match up, the gradients will not match either, leading to training that may not converge.

"""starts from main.py line 347""" # fix the seed for reproducibility seed = args.seed + dist.get_rank() torch.manual_seed(seed) np.random.seed(seed) random.seed(seed) # torch.set_deterministic(True) torch.use_deterministic_algorithms(True) # Build the model model, criterion, weight_dict = build_model(args) model.to(device)
opened by K-Nick 2
Problem with dataset Download

Hello, there are many links of vidstg dataset that fail to work on Baidu. Part1, part2 and Part4 cannot be downloaded. Could you please send me a dataset there

opened by Xiyu-AI 1

Training error in tubedetr.py file.

I try to train the network on HC-STVGv2 dataset using the command provided in the README.md file:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --ema \                                                                                       
  2 --load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \                                                                  
  3 --v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=output --batch_size=8

Unfortunately, I encountered this issue in models/tubedetr.py line 180

  File "/root/paddlejob/workspace/STVG/TubeDETR/models/tubedetr.py", line 180, in forward                                                                                 
    tpad_src = tpad_src.view(b * n_clips, f, h, w)                                                                                                                        
RuntimeError: shape '[160, 256, 7, 12]' is invalid for input of size 2817024

. Besides, the durations of the eight samples are: [100, 100, 69, 100, 65, 86, 100, 100].

I think this problem is probably related to the padding approach. Do you have any clue with this BUG and how to fix it? Thank you very much!

opened by OliverHxh 2

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Related tags

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Setup

Data Downloading

Data Preprocessing

Training

Available Checkpoints

Evaluation

Spatio-Temporal Video Grounding Demo

Acknowledgements

Citation

Comments

9.3.0

Changes

9.3.0 (2022-10-29)

9.0.1

Changes

9.0.0

Changes

9.0.1 (2022-02-03)

9.0.0 (2022-01-02)

v1.22.0

NumPy 1.22.0 Release Notes

Expired deprecations

Deprecated numeric style dtype strings have been removed

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

Owner

Antoine Yang

Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Deep generative modeling for time-stamped heterogeneous data, enabling high-fidelity models for a large variety of spatio-temporal domains.

Learning Spatio-Temporal Transformer for Visual Tracking

Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks

Implementation of the "PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences" paper.

Implementation of the "Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos" paper.

Code for the paper "Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds" (ICCV 2021)

Digital Twin Mobility Profiling: A Spatio-Temporal Graph Learning Approach

DeepSTD: Mining Spatio-temporal Disturbances of Multiple Context Factors for Citywide Traffic Flow Prediction

Codes for TIM2021 paper "Anchor-Based Spatio-Temporal Attention 3-D Convolutional Networks for Dynamic 3-D Point Cloud Sequences"

Self-supervised spatio-spectro-temporal represenation learning for EEG analysis

[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

Expired deprecations for `loads`, `ndfromtxt`, and `mafromtxt` in npyio