Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Kakao Brain

Last update: Nov 28, 2022

Related tags

Deep Learning HOTR

Overview

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation)

HOTR: End-to-End Human-Object Interaction Detection with Transformers

HOTR is a novel framework which directly predicts a set of {human, object, interaction} triplets from an image using a transformer-based encoder-decoder. Through the set-level prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.

HOTR is composed of three main components: a shared encoder with a CNN backbone, a parallel decoder, and the recomposition layer to generate final HOI triplets. The overview of our pipeline is presented below.

1. Environmental Setup

$ conda create -n kakaobrain python=3.7
$ conda install -c pytorch pytorch torchvision # PyTorch 1.7.1, torchvision 0.8.2, CUDA=11.0
$ conda install cython scipy
$ pip install pycocotools
$ pip install opencv-python
$ pip install wandb

2. HOI dataset setup

Our current version of HOTR supports the experiments for V-COCO dataset. Download the v-coco dataset under the pulled directory.

# V-COCO setup
$ git clone https://github.com/s-gupta/v-coco.git
$ cd v-coco
$ ln -s [:COCO_DIR] coco/images # COCO_DIR contains images of train2014 & val2014
$ python script_pick_annotations.py [:COCO_DIR]/annotations

If you wish to download the v-coco on our own directory, simply change the 'data_path' argument to the directory you have downloaded the v-coco dataset.

--data_path [:your_own_directory]/v-coco

3. How to Train/Test HOTR on V-COCO dataset

For testing, you can either use your own trained weights and pass the directory to the 'resume' argument, or use our provided weights. Below is the example of how you should edit the Makefile.

# [Makefile]
# Testing your own trained weights
multi_test:
  python -m torch.distributed.launch \
		--nproc_per_node=8 \
    ...
    --resume checkpoints/vcoco/KakaoBrain/multi_run_000001/best.pth # the best performing checkpoint is saved in this format

# Testing our provided trained weights
multi_test:
  python -m torch.distributed.launch \
		--nproc_per_node=8 \
    ...
    --resume checkpoints/vcoco/q16.pth # download the q16.pth as described below.

In order to use our provided weights, you can download the weights from this link. Then, pass the directory of the downloaded file (for example, we put the weights under the directory checkpoints/vcoco/q16.pth) to the 'resume' argument as well.

# multi-gpu training / testing (8 GPUs)
$ make multi_[train/test]

# single-gpu training / testing
$ make single_[train/test]

4. Results

Here, we provide improved results of V-COCO Scenario 1 (58.9 mAP, 0.5ms) from the version of our initial submission (55.2 mAP, 0.9ms). This is obtained "without" applying any priors on the scores (see iCAN).

Epoch	# queries	Scenario 1	Scenario 2	Checkpoint
100	16	58.9	63.8	download

If you want to use pretrained weights for inference, download the pretrained weights (from the above link) under checkpoints/vcoco/ and match the interaction query argument as described in the weight file (others are already set in the Makefile). Our evaluation code follows the exact implementations of the official python v-coco evaluation. You can test the weights by the command below (e.g., the weight file is named as q16.pth, which denotes that the model uses 16 interaction queries).

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env vcoco_main.py \
    --batch_size 2 \
    --HOIDet \
    --share_enc \
    --pretrained_dec \
    --num_hoi_queries [:query_num] \
    --temperature 0.05 \ # use the exact same temperature value that you used during training!
    --object_threshold 0 \
    --no_aux_loss \
    --eval \
    --dataset_file vcoco \
    --data_path v-coco \
    --resume checkpoints/vcoco/[:query_num].pth

The results will appear as the following:

[Logger] Number of params:  51181950
Evaluation Inference (V-COCO)  [308/308]  eta: 0:00:00    time: 0.2063  data: 0.0127  max mem: 1578
[stats] Total Time (test) : 0:01:05 (0.2114 s / it)
[stats] HOI Recognition Time (avg) : 0.5221 ms
[stats] Distributed Gathering Time : 0:00:49
[stats] Score Matrix Generation completed

============= AP (Role scenario_1) ==============
               hold_obj: AP = 48.99 (#pos = 3608)
              sit_instr: AP = 47.81 (#pos = 1916)
             ride_instr: AP = 67.04 (#pos = 556)
               look_obj: AP = 40.57 (#pos = 3347)
              hit_instr: AP = 76.42 (#pos = 349)
                hit_obj: AP = 71.27 (#pos = 349)
                eat_obj: AP = 55.75 (#pos = 521)
              eat_instr: AP = 67.57 (#pos = 521)
             jump_instr: AP = 71.44 (#pos = 635)
              lay_instr: AP = 57.09 (#pos = 387)
    talk_on_phone_instr: AP = 49.07 (#pos = 285)
              carry_obj: AP = 34.75 (#pos = 472)
              throw_obj: AP = 52.37 (#pos = 244)
              catch_obj: AP = 48.80 (#pos = 246)
              cut_instr: AP = 49.58 (#pos = 269)
                cut_obj: AP = 57.02 (#pos = 269)
 work_on_computer_instr: AP = 67.44 (#pos = 410)
              ski_instr: AP = 49.35 (#pos = 424)
             surf_instr: AP = 77.07 (#pos = 486)
       skateboard_instr: AP = 86.44 (#pos = 417)
            drink_instr: AP = 38.67 (#pos = 82)
               kick_obj: AP = 73.92 (#pos = 180)
               read_obj: AP = 44.81 (#pos = 111)
        snowboard_instr: AP = 81.25 (#pos = 277)
| mAP(role scenario_1): 58.94
----------------------------------------------------

The HOI recognition time is calculated by the end-to-end inference time excluding the object detection time.

5. Auxiliary Loss

HOTR follows the auxiliary loss of DETR, where the loss between the ground truth and each output of the decoder layer is also computed. The ground-truth for the auxiliary outputs are matched with the ground-truth HOI triplets with our proposed Hungarian Matcher.

6. Temperature Hyperparameter, tau

Based on our experimental results, the temperature hyperparameter is sensitive to the number of interaction queries and the coefficient for the index loss and index cost, and the number of decoder layers. Empirically, a larger number of queries require a larger tau, and a smaller coefficient for the loss and cost for HO Pointers requires a smaller tau (e.g., for 16 interaction queries, tau=0.05 for the default set_cost_idx=1, hoi_idx_loss_coef=1, hoi_act_loss_coef=10 shows the best result). The initial version of HOTR (with 55.2 mAP) has been trained with 100 queries, which required a larger tau (tau=0.1). There might be better results than the tau we used in our paper according to these three factors. Feel free to explore yourself!

7. Citation

If you find this code helpful for your research, please cite our paper.

@inproceedings{kim2021hotr,
  title={HOTR: End-to-End Human-Object Interaction Detection with Transformers},
  author    = {Bumsoo Kim and
               Junhyun Lee and
               Jaewoo Kang and
               Eun-Sol Kim and
               Hyunwoo J. Kim},
  booktitle = {CVPR},
  publisher = {IEEE},
  year      = {2021}
}

8. Contact for Issues

Bumsoo Kim, [email protected]

9. License

Comments

output of hoi matcher

hello, i spend some time to read the code of hotr_matcher.py, but i couldn't understand it clearly. when i debug the code, the output of hotr_matcher.py is [(tensor([1]), tensor([0])), (tensor([4]), tensor([0]))] with bs=2, could you explain the meaning of this output? each element represents what? thanks very much, looking forward to your reply!

opened by SISTMrL 4
demo code

Dear author: Thanks for sharing the training and validation code. A lot of researchers, just like me, want to quickly try your work and test it on custom images. Would you kindly share the demo code to inference on single image? Thank you very much.

opened by dragen1860 3
pretrained detr on hicodet

hello, have you tried to finetune the whole structure of detr on hicodet datasets later? i want to know the performance gap with detr without finetune.

opened by SISTMrL 2
file not found error

hello, when i'm going to reproduce the hicodet of hotr, i encounter the error follows: FileNotFoundError: [Errno 2] No such file or directory: 'hico_20160224_det/list_action.txt'

could you provide the list_action.txt for me? thanks! i use the hicodet datasets i had download before, not your repo

looking forward to your reply, thanks!

opened by SISTMrL 2
the question of loss log
hello, i read the code of loss calculation, you take loss_value and loss_dict_reduced_scaled to metric_logger as input. but your metric_logger class is difficult for me to understand. so i have three questions about the loss log. as is shown in figure

i saw each loss have two items such as: loss_act:1.3329(1.5253), how the meaning of 1.3329 and 1.5253

the loss_value is sum of loss_dict_reduced_scaled's values. but when i add the each loss respectively, as is shown in figure the sum 21.6664 != 24.1092, but the other sum is correct, can you explain it?

when i saw the loss in the training process, which loss should i pay attention to? the loss in the parentheses or loss is outside parenthneses.

looking foward to your reply! thanks!
opened by SISTMrL 2
Conversion to ONNX

While trying to convert to ONNX from pytorch model I face issue RuntimeError: output 1 (0.209699 [ CPUDoubleType{} ]) of traced region did not have observable data dependence with trace inputs; this probably indicates your program cannot be understood by the tracer.

Could you check if there's some issue and if so could you suggest a fix

opened by Sanidhya27 1
can not download hico-det annotation files

Thanks for your great work. hico det dataset 's "annotation files " download link is same as "HICO-DET" .it lead to can not download “annotation files”

opened by JackWhite-rwx 1
cope with zero-hoi image sample?

Dear author: I noticed that there is at least one hoi annotaion for each image in HICO dataset. and when i try to reduce the hoi categoreies number, it will create some images without any hoi annotation. Thus when training, the Matcher will trigger errors. I wonder How to deal with zero-hoi images when traininig your model. I suppose zsero-hoi sample setting is general in real scenario. Thank you.

opened by dragen1860 1
CUDA out of memory problem

Thanks for your nice work. When evaluating HOTR on vcoco dataset (vcoco_multi_train) on a server with 8 GeForce RTX 2080Ti Cards, I encountered the CUDA out of memory problem.

File "/project/HOI/HOTR-main/hotr/engine/evaluator_vcoco.py", line 53, in vcoco_evaluate gather_res = utils.all_gather(res) File "/project/HOI/HOTR-main/hotr/util/misc.py", line 129, in all_gather data_list.append(pickle.loads(buffer)) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/storage.py", line 141, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 595, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 774, in _legacy_load result = unpickler.load() File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 730, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 155, in _cuda_deserialize return storage_type(obj.size()) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/cuda/init.py", line 462, in _lazy_new return super(_CudaBase, cls).new(cls, *args, **kwargs) RuntimeError: CUDA error: out of memory

This problem seems happens on line 53 of evaluator_vcoco.py, utils.all_gather(res) . Any suggestions how to solve this problem? Thanks a lot.

opened by GWwangshuo 1
parameter temperature

hello, in your readme, you said the temperature is relative with number of decoder. but you don't tell us how to adjust the temperature according the number of decoders. looking forward to your reply, thanks!

opened by SISTMrL 1
HOI for real scenario deployment

Dear author: I tested some influential HOI algorithms, such as iCAN, and found these algorithms may perform good in public dataset. However, when transfered to real scenario images, such as the photos taken by my mobile phone, i found the performance dropped severely. My team are trying to make HOI work on some real scenario, therefore I would invite you to reccomend a neat and effective algorithm to use. Which algorithm will you reccommend,? Thank you.

opened by dragen1860 1
Visualised predictions

Hello and thank you for your great work!

I am currently training your model with custom data, but I would like to run the predictions on a test data set and visualise them.

In a previous post you said that you will soon provide the code for this. However, I haven't found anything in the repository yet. If the code is not yet available, can you tell me how and where I can use the predications to visualise them on test images?

opened by Gaussianer 0
HOTR for Custom Data

Hello I am currently writing my master thesis in the field of HOI detection. I would like to use custom data for this. However, I still lack any clues on how to annotate this data. Can you recommend a tool for this? I would also like to investigate HOTR in more detail in my thesis and write a paper about it. I would be very happy to receive a response.

Best regards

opened by Gaussianer 1
conda activate

dear @meliketoy I'm trying to run demo code. In environment setup, There is no activate kakaobrain. could you add that line? It is good for junior developer like me. Thanks!

opened by saintrealchoi 0

Owner

Kakao Brain

Kakao Brain Corp.

GitHub

Official repository for CVPR21 paper "Deep Stable Learning for Out-Of-Distribution Generalization".

StableNet StableNet is a deep stable learning method for out-of-distribution generalization. This is the official repo for CVPR21 paper "Deep Stable L

120 Dec 28, 2022

Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

105 Dec 23, 2022

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

687 Jan 7, 2023

[CVPR21] LightTrack: Finding Lightweight Neural Network for Object Tracking via One-Shot Architecture Search

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search The official implementation of the paper LightTra

290 Dec 24, 2022

Repository relating to the CVPR21 paper TimeLens: Event-based Video Frame Interpolation

TimeLens: Event-based Video Frame Interpolation This repository is about the High Speed Event and RGB (HS-ERGB) dataset, used in the 2021 CVPR paper T

544 Dec 19, 2022

Code for HLA-Face: Joint High-Low Adaptation for Low Light Face Detection (CVPR21)

HLA-Face: Joint High-Low Adaptation for Low Light Face Detection The official PyTorch implementation for HLA-Face: Joint High-Low Adaptation for Low L

77 Dec 8, 2022

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University)

842 Jan 4, 2023

Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"

MeshTransformer ✨ This is our research code of End-to-End Human Pose and Mesh Reconstruction with Transformers. MEsh TRansfOrmer is a simple yet effec

473 Dec 31, 2022

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

41 Dec 9, 2022

This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

MOTR: End-to-End Multiple-Object Tracking with TRansformer This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object

348 Jan 7, 2023

[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers This is the official PyTorch implementation and models for UP-DETR paper: @a

430 Dec 23, 2022

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

LMFD-PAD Note This is the official repository of the paper: LMFD-PAD: Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechani

28 Dec 2, 2022

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Related tags

Overview

HOTR: End-to-End Human-Object Interaction Detection with Transformers

1. Environmental Setup

2. HOI dataset setup

3. How to Train/Test HOTR on V-COCO dataset

4. Results

5. Auxiliary Loss

6. Temperature Hyperparameter, tau

7. Citation

8. Contact for Issues

9. License

Comments

Owner

Kakao Brain

Official repository for CVPR21 paper "Deep Stable Learning for Out-Of-Distribution Generalization".

Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

[CVPR21] LightTrack: Finding Lightweight Neural Network for Object Tracking via One-Shot Architecture Search

Repository relating to the CVPR21 paper TimeLens: Event-based Video Frame Interpolation

Code for HLA-Face: Joint High-Low Adaptation for Low Light Face Detection (CVPR21)

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

[CVPR 2022 Oral] MixFormer: End-to-End Tracking with Iterative Mixed Attention

End-to-End Object Detection with Fully Convolutional Network

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals, CVPR2021

Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework