[ICCV 2021] Target Adaptive Context Aggregation for Video Scene Graph Generation

Related tags

Deep Learning TRACE

Overview

Target Adaptive Context Aggregation for Video Scene Graph Generation

This is a PyTorch implementation for Target Adaptive Context Aggregation for Video Scene Graph Generation.

Requirements

PyTorch >= 1.2 (Mine 1.7.1 (CUDA 10.1))
torchvision >= 0.4 (Mine 0.8.2 (CUDA 10.1))
cython
matplotlib
numpy
scipy
opencv
pyyaml
packaging
pycocotools
tensorboardX
tqdm
pillow
scikit-image
h5py
yacs
ninja
overrides
mmcv

Compilation

Compile the CUDA code in the Detectron submodule and in the repo:

# ROOT=path/to/cloned/repository
cd $ROOT/Detectron_pytorch/lib
sh make.sh
cd $ROOT/lib
sh make.sh

Data Preparation

Download Datasets

Download links: VidVRD and AG.

Create directories for datasets. The directories for ./data/ should look like:

|-- data
|   |-- ag
|   |-- vidvrd
|   |-- obj_embed

where ag and vidvrd are for AG and VidVRD datasets, and obj_embed is for GloVe, the weights of pre-trained word vectors. The final directories for GloVe should look like:

|-- obj_embed
|   |-- glove.6B.200d.pt
|   |-- glove.6B.300d.pt
|   |-- glove.6B.300d.txt
|   |-- glove.6B.200d.txt
|   |-- glove.6B.100d.txt
|   |-- glove.6B.50d.txt
|   |-- glove.6B.300d

AG

Put the .mp4 files into ./data/ag/videos/. Put the annotations into ./data/ag/annotations/.

The final directories for VidVRD dataset should look like:

|-- ag
|   |-- annotations
|   |   |-- object_classes.txt
|   |   |-- ...
|   |-- videos
|   |   |-- ....mp4
|   |-- Charades_annotations

VidVRD

Put the .mp4 files into ./data/vidvrd/videos/. Put the three documents test, train and videos from the vidvrd-annoataions into ./data/vidvrd/annotations/.

Download precomputed precomputed features, model and detected relations from here (or here). Extract features and models into ./data/vidvrd/.

The final directories for VidVRD dataset should look like:

|-- vidvrd
|   |-- annotations
|   |   |-- test
|   |   |-- train
|   |   |-- videos
|   |   |-- predicate.txt
|   |   |-- object.txt
|   |   |-- ...
|   |-- features
|   |   |-- relation
|   |   |-- traj_cls
|   |   |-- traj_cls_gt
|   |-- models
|   |   |-- baseline_setting.json
|   |   |-- ...
|   |-- videos
|   |   |-- ILSVRC2015_train_00005003.mp4
|   |   |-- ...

Change the format of annotations for AG and VidVRD

# ROOT=path/to/cloned/repository
cd $ROOT

python tools/rename_ag.py

python tools/rename_vidvrd_anno.py

python tools/get_vidvrd_pretrained_rois.py --out_rpath pre_processed_boxes_gt_dense_more --rpath traj_cls_gt

python tools/get_vidvrd_pretrained_rois.py --out_rpath pre_processed_boxes_dense_more

Dump frames

Our ffmpeg version is 4.2.2-0york0~16.04 so using --ignore_editlist to avoid some frames being ignored. The jpg format saves the drive space.

Dump the annotated frames for AG and VidVRD.

python tools/dump_frames.py --ignore_editlist

python tools/dump_frames.py --ignore_editlist --video_dir data/vidvrd/videos --frame_dir data/vidvrd/frames --frame_list_file val_fname_list.json,train_fname_list.json --annotation_dir data/vidvrd/annotations --st_id 0

Dump the sampled high quality frames for AG and VidVRD.

python tools/dump_frames.py --frame_dir data/ag/sampled_frames --ignore_editlist --frames_store_type jpg --high_quality --sampled_frames

python tools/dump_frames.py --ignore_editlist --video_dir data/vidvrd/videos --frame_dir data/vidvrd/sampled_frames --frame_list_file val_fname_list.json,train_fname_list.json --annotation_dir data/vidvrd/annotations --frames_store_type jpg --high_quality --sampled_frames --st_id 0

If you want to dump all frames with jpg format.

python tools/dump_frames.py --all_frames --frame_dir data/ag/all_frames --ignore_editlist --frames_store_type jpg

Get classes in json format for AG

# ROOT=path/to/cloned/repository
cd $ROOT
python txt2json.py

Get Charades train/test split for AG

Download Charades annotations and extract the annotations into ./data/ag/Charades_annotations/. Then run,

# ROOT=path/to/cloned/repository
cd $ROOT
python tools/dataset_split.py

Pretrained Models

Download model weights from here.

pretrained object detection
TRACE trained on VidVRD in detection_models/vidvrd/trained_rel
TRACE trained on AG in detection_models/ag/trained_rel

Performance

VidVrd, gt box

Method	mAP	Recall@50	Recall@100
TRACE	30.6	19.3	24.6

VidVrd, detected box

Method	mAP	Recall@50	Recall@100
TRACE	16.3	9.2	11.2

AG, detected box

Training Relationship Detection Models

VidVRD

# ROOT=path/to/cloned/repository
cd $ROOT

CUDA_VISIBLE_DEVICES=0 python tools/train_net_step_rel.py --dataset vidvrd --cfg configs/vidvrd/vidvrd_res101xi3d50_all_boxes_sample_train_flip_dc5_2d_new.yaml --nw 8 --use_tfboard --disp_interval 20 --o SGD --lr 0.025

AG

# ROOT=path/to/cloned/repository
cd $ROOT

CUDA_VISIBLE_DEVICES=0 python tools/train_net_step_rel.py --dataset ag --cfg configs/ag/res101xi3d50_dc5_2d.yaml --nw 8 --use_tfboard --disp_interval 20 --o SGD --lr 0.01

Evaluating Relationship Detection Models

VidVRD

evaluation for gt boxes

CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 python tools/test_net_rel.py --dataset vidvrd --cfg configs/vidvrd/vidvrd_res101xi3d50_gt_boxes_dc5_2d_new.yaml --load_ckpt Outputs/vidvrd_res101xi3d50_all_boxes_sample_train_flip_dc5_2d_new/Aug01-16-20-06_gpuserver-11_step_with_prd_cls_v3/ckpt/model_step12999.pth --output_dir Outputs/vidvrd_new101 --do_val --multi-gpu-testing

python tools/transform_vidvrd_results.py --input_dir Outputs/vidvrd_new101 --output_dir Outputs/vidvrd_new101 --is_gt_traj

python tools/test_vidvrd.py --prediction Outputs/vidvrd_new101/baseline_relation_prediction.json --groundtruth data/vidvrd/annotations/test_gt.json

evaluation for detected boxes

CUDA_VISIBLE_DEVICES=1 python tools/test_net_rel.py --dataset vidvrd --cfg configs/vidvrd/vidvrd_res101xi3d50_pred_boxes_flip_dc5_2d_new.yaml --load_ckpt Outputs/vidvrd_res101xi3d50_all_boxes_sample_train_flip_dc5_2d_new/Aug01-16-20-06_gpuserver-11_step_with_prd_cls_v3/ckpt/model_step12999.pth --output_dir Outputs/vidvrd_new101_det2 --do_val

python tools/transform_vidvrd_results.py --input_dir Outputs/vidvrd_new101_det2 --output_dir Outputs/vidvrd_new101_det2

python tools/test_vidvrd.py --prediction Outputs/vidvrd_new101_det2/baseline_relation_prediction.json --groundtruth data/vidvrd/annotations/test_gt.json

AG

evaluation for detected boxes, Recalls (SGDet)

CUDA_VISIBLE_DEVICES=4 python tools/test_net_rel.py --dataset ag --cfg configs/ag/res101xi3d50_dc5_2d.yaml --load_ckpt Outputs/res101xi3d50_dc5_2d/Nov01-21-50-49_gpuserver-11_step_with_prd_cls_v3/ckpt/model_step177329.pth --output_dir Outputs/ag_val_101_ag_dc5_jin_map_new_infer_multiatten --do_val

#evaluation for detected boxes, mRecalls
python tools/visualize.py  --output_dir Outputs/ag_val_101_ag_dc5_jin_map_new_infer_multiatten --num 60000 --no_do_vis --rel_class_recall

evaluation for detected boxes, mAP_{rel}

CUDA_VISIBLE_DEVICES=4 python tools/test_net_rel.py --dataset ag --cfg configs/ag/res101xi3d50_dc5_2d.yaml --load_ckpt Outputs/res101xi3d50_dc5_2d/Nov01-21-50-49_gpuserver-11_step_with_prd_cls_v3/ckpt/model_step177329.pth --output_dir Outputs/ag_val_101_ag_dc5_jin_map_new_infer_multiatten --do_val --eva_map --topk 50

evaluation for gt boxes, Recalls (SGCls)

CUDA_VISIBLE_DEVICES=4 python tools/test_net_rel.py --dataset ag --cfg configs/ag/res101xi3d50_dc5_2d.yaml --load_ckpt Outputs/res101xi3d50_dc5_2d/Nov01-21-50-49_gpuserver-11_step_with_prd_cls_v3/ckpt/model_step177329.pth --output_dir Outputs/ag_val_101_ag_dc5_jin_map_new_infer_multiatten --do_val --use_gt_boxes

#evaluation for detected boxes, mRecalls
python tools/visualize.py  --output_dir Outputs/ag_val_101_ag_dc5_jin_map_new_infer_multiatten --num 60000 --no_do_vis --rel_class_recall

evaluation for gt boxes, gt object labels, Recalls (PredCls)

CUDA_VISIBLE_DEVICES=4 python tools/test_net_rel.py --dataset ag --cfg configs/ag/res101xi3d50_dc5_2d.yaml --load_ckpt Outputs/res101xi3d50_dc5_2d/Nov01-21-50-49_gpuserver-11_step_with_prd_cls_v3/ckpt/model_step177329.pth --output_dir Outputs/ag_val_101_ag_dc5_jin_map_new_infer_multiatten --do_val --use_gt_boxes --use_gt_labels

#evaluation for detected boxes, mRecalls
python tools/visualize.py  --output_dir Outputs/ag_val_101_ag_dc5_jin_map_new_infer_multiatten --num 60000 --no_do_vis --rel_class_recall

Hint

We apply the dilation convolution in I3D now, but observe a gridding effect in temporal feature maps.

Acknowledgements

This project is built on top of ContrastiveLosses4VRD, ActionGenome and VidVRD-helper. The corresponding papers are Graphical Contrastive Losses for Scene Graph Parsing, Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs and Video Visual Relation Detection.

Citing

If you use this code in your research, please use the following BibTeX entry.

@inproceedings{Target_Adaptive_Context_Aggregation_for_Video_Scene_Graph_Generation,
  author    = {Yao Teng and
               Limin Wang and
               Zhifeng Li and
               Gangshan Wu},
  title     = {Target Adaptive Context Aggregation for Video Scene Graph Generation},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages     = {13688--13697},
  year      = {2021}
}

Comments

RuntimeError: Error compiling objects for extension

Hi, is it possible to release a requirements.txt with version for each library? I am currently trying to build Detectron and I have been running into this error:

    with_cuda=with_cuda)
  File "/home/siteng/anaconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1255, in _write_ninja_file_and_compile_objects
    error_prefix='Error compiling objects for extension')
  File "/home/siteng/anaconda3/envs/test/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

I am speculating that there may be some kind of compatibility issue. Thank you!

opened by Tangolin 5

Link for downloading features for VIDVRD dataset cannot be accessed

I find this an interesting work and I would like to reimplement your work following the code. However, I find that I cannot access the link you provided in readme for features for VIDVRD dataset(https://internal-api-drive-stream.larksuite.com/space/api/box/stream/download/all/boxusS8Z0kwEizoPPh5h7vx7Usf/). Can I know is there any other way for me to access it? Thanks in advance

opened by Harryqu123 5
Questions about training time and redundant calculation
Thank you for your awesome work and I enjoy reading it. When I tried to run your code, I found two questions:

How long does it spend to train on AG dataset under the PredCLS settings? I run test_net_rel.py with one 3090 GPU finding that the run_inference process would take about 6 hours and that's just the test process. May I ask how long the whole training process took?

For each center frame in the same video, the model would compute the temporal features with I3D. However, the sampled frame segment could be highly similar for different center frames from the same video. I wonder that if some of the I3D computation is redundant?
opened by qncsn2016 2
ValueError:axes don't match array

Hello, the author, thank you very much for your excellent work of open source. When I used the pre training model to test, I encountered the following problems. How can I solve them? I want to evaluate for detected boxes, Recalls (SGDet) dataset AG

opened by Fillip1233 0
Evaluation on dataset/custom videos

Hi, I tried running the evaluation script of the VidVRD pretrained model, but the documentation is a little vague in this regard. At the moment I'm using the dataset mentioned but I want to evaluate custom videos too. Still, using the mentioned dataset, I always get the error of missing baseline_relation_prediction.json file in the Output folder. I am using Google Colab to run the code and hence changed the directories paths in the code scripts. Can you tell me why isn't the baseline relation JSON file not being created in the output folder, what could be the reason? Because I tried the paths in every script we're calling. I think there is some problem with the CUDA availability because it didn't run with the CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 (in the evaluation script) so I tried changing it to CUDA_VISIBLE_DEVICES=1 and it still didn't work so I used CUDA_VISIBLE_DEVICES=0. Can you please take a look

Thank You

opened by harisali2022skipq 1
Missing detections_train.json

AssertionError: Annotation file '/vision2/u/zhuoyih/projects/moma_scene_graph/TRACE/data/ag/annotations/detections_train.json' not found

Hi, I got this error while I was trying to train on the action genome dataset, however, I searched there's no detections_train.json in action genome dataset or ways to generate this file, could you give more information on this, please?

opened by Yvonne0413 1
Implement of custom videos

Hi, dear author, i want to obtain the video scene graphs on custom videos (e.g. the videos from ActivityNet dataset), could you provide some script file to guide my research?

opened by ltp1995 2
Performance very different to Action Genome baselines

Thanks for sharing the nice work!

But I find the performance presented in the paper is very different with methods in "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs" and "detecting human-object relationships in videos". What are the causes for this?

opened by zyong812 3

[ICCV 2021] Target Adaptive Context Aggregation for Video Scene Graph Generation

Related tags

Overview

Target Adaptive Context Aggregation for Video Scene Graph Generation

Requirements

Compilation

Data Preparation

Download Datasets

AG

VidVRD

Change the format of annotations for AG and VidVRD

Dump frames

Get classes in json format for AG

Get Charades train/test split for AG

Pretrained Models

Performance

Training Relationship Detection Models

VidVRD

AG

Evaluating Relationship Detection Models

VidVRD

AG

Hint

Acknowledgements

Citing

Comments

Owner

Multimedia Computing Group, Nanjing University

CVPR2021: Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Container : Context Aggregation Network

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021)

AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Source code for paper "Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling", AAAI 2021

【ACMMM 2021】DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Exploring Relational Context for Multi-Task Dense Prediction [ICCV 2021]

Code for "Learning Canonical Representations for Scene Graph to Image Generation", Herzig & Bar et al., ECCV2020

image scene graph generation benchmark

Adaptive Pyramid Context Network for Semantic Segmentation (APCNet CVPR'2019)

Official Implementation of HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation

[TIP2020] Adaptive Graph Representation Learning for Video Person Re-identification

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization

Fast and Context-Aware Framework for Space-Time Video Super-Resolution (VCIP 2021)

[TIP 2020] Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion