Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Related tags

Deep Learning HOTR
Overview


Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation)

HOTR: End-to-End Human-Object Interaction Detection with Transformers

HOTR is a novel framework which directly predicts a set of {human, object, interaction} triplets from an image using a transformer-based encoder-decoder. Through the set-level prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.

HOTR is composed of three main components: a shared encoder with a CNN backbone, a parallel decoder, and the recomposition layer to generate final HOI triplets. The overview of our pipeline is presented below.

1. Environmental Setup

$ conda create -n kakaobrain python=3.7
$ conda install -c pytorch pytorch torchvision # PyTorch 1.7.1, torchvision 0.8.2, CUDA=11.0
$ conda install cython scipy
$ pip install pycocotools
$ pip install opencv-python
$ pip install wandb

2. HOI dataset setup

Our current version of HOTR supports the experiments for V-COCO dataset. Download the v-coco dataset under the pulled directory.

# V-COCO setup
$ git clone https://github.com/s-gupta/v-coco.git
$ cd v-coco
$ ln -s [:COCO_DIR] coco/images # COCO_DIR contains images of train2014 & val2014
$ python script_pick_annotations.py [:COCO_DIR]/annotations

If you wish to download the v-coco on our own directory, simply change the 'data_path' argument to the directory you have downloaded the v-coco dataset.

--data_path [:your_own_directory]/v-coco

3. How to Train/Test HOTR on V-COCO dataset

For testing, you can either use your own trained weights and pass the directory to the 'resume' argument, or use our provided weights. Below is the example of how you should edit the Makefile.

# [Makefile]
# Testing your own trained weights
multi_test:
  python -m torch.distributed.launch \
		--nproc_per_node=8 \
    ...
    --resume checkpoints/vcoco/KakaoBrain/multi_run_000001/best.pth # the best performing checkpoint is saved in this format

# Testing our provided trained weights
multi_test:
  python -m torch.distributed.launch \
		--nproc_per_node=8 \
    ...
    --resume checkpoints/vcoco/q16.pth # download the q16.pth as described below.

In order to use our provided weights, you can download the weights from this link. Then, pass the directory of the downloaded file (for example, we put the weights under the directory checkpoints/vcoco/q16.pth) to the 'resume' argument as well.

# multi-gpu training / testing (8 GPUs)
$ make multi_[train/test]

# single-gpu training / testing
$ make single_[train/test]

4. Results

Here, we provide improved results of V-COCO Scenario 1 (58.9 mAP, 0.5ms) from the version of our initial submission (55.2 mAP, 0.9ms). This is obtained "without" applying any priors on the scores (see iCAN).

Epoch # queries Scenario 1 Scenario 2 Checkpoint
100 16 58.9 63.8 download

If you want to use pretrained weights for inference, download the pretrained weights (from the above link) under checkpoints/vcoco/ and match the interaction query argument as described in the weight file (others are already set in the Makefile). Our evaluation code follows the exact implementations of the official python v-coco evaluation. You can test the weights by the command below (e.g., the weight file is named as q16.pth, which denotes that the model uses 16 interaction queries).

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env vcoco_main.py \
    --batch_size 2 \
    --HOIDet \
    --share_enc \
    --pretrained_dec \
    --num_hoi_queries [:query_num] \
    --temperature 0.05 \ # use the exact same temperature value that you used during training!
    --object_threshold 0 \
    --no_aux_loss \
    --eval \
    --dataset_file vcoco \
    --data_path v-coco \
    --resume checkpoints/vcoco/[:query_num].pth

The results will appear as the following:

[Logger] Number of params:  51181950
Evaluation Inference (V-COCO)  [308/308]  eta: 0:00:00    time: 0.2063  data: 0.0127  max mem: 1578
[stats] Total Time (test) : 0:01:05 (0.2114 s / it)
[stats] HOI Recognition Time (avg) : 0.5221 ms
[stats] Distributed Gathering Time : 0:00:49
[stats] Score Matrix Generation completed

============= AP (Role scenario_1) ==============
               hold_obj: AP = 48.99 (#pos = 3608)
              sit_instr: AP = 47.81 (#pos = 1916)
             ride_instr: AP = 67.04 (#pos = 556)
               look_obj: AP = 40.57 (#pos = 3347)
              hit_instr: AP = 76.42 (#pos = 349)
                hit_obj: AP = 71.27 (#pos = 349)
                eat_obj: AP = 55.75 (#pos = 521)
              eat_instr: AP = 67.57 (#pos = 521)
             jump_instr: AP = 71.44 (#pos = 635)
              lay_instr: AP = 57.09 (#pos = 387)
    talk_on_phone_instr: AP = 49.07 (#pos = 285)
              carry_obj: AP = 34.75 (#pos = 472)
              throw_obj: AP = 52.37 (#pos = 244)
              catch_obj: AP = 48.80 (#pos = 246)
              cut_instr: AP = 49.58 (#pos = 269)
                cut_obj: AP = 57.02 (#pos = 269)
 work_on_computer_instr: AP = 67.44 (#pos = 410)
              ski_instr: AP = 49.35 (#pos = 424)
             surf_instr: AP = 77.07 (#pos = 486)
       skateboard_instr: AP = 86.44 (#pos = 417)
            drink_instr: AP = 38.67 (#pos = 82)
               kick_obj: AP = 73.92 (#pos = 180)
               read_obj: AP = 44.81 (#pos = 111)
        snowboard_instr: AP = 81.25 (#pos = 277)
| mAP(role scenario_1): 58.94
----------------------------------------------------

The HOI recognition time is calculated by the end-to-end inference time excluding the object detection time.

5. Auxiliary Loss

HOTR follows the auxiliary loss of DETR, where the loss between the ground truth and each output of the decoder layer is also computed. The ground-truth for the auxiliary outputs are matched with the ground-truth HOI triplets with our proposed Hungarian Matcher.

6. Temperature Hyperparameter, tau

Based on our experimental results, the temperature hyperparameter is sensitive to the number of interaction queries and the coefficient for the index loss and index cost, and the number of decoder layers. Empirically, a larger number of queries require a larger tau, and a smaller coefficient for the loss and cost for HO Pointers requires a smaller tau (e.g., for 16 interaction queries, tau=0.05 for the default set_cost_idx=1, hoi_idx_loss_coef=1, hoi_act_loss_coef=10 shows the best result). The initial version of HOTR (with 55.2 mAP) has been trained with 100 queries, which required a larger tau (tau=0.1). There might be better results than the tau we used in our paper according to these three factors. Feel free to explore yourself!

7. Citation

If you find this code helpful for your research, please cite our paper.

@inproceedings{kim2021hotr,
  title={HOTR: End-to-End Human-Object Interaction Detection with Transformers},
  author    = {Bumsoo Kim and
               Junhyun Lee and
               Jaewoo Kang and
               Eun-Sol Kim and
               Hyunwoo J. Kim},
  booktitle = {CVPR},
  publisher = {IEEE},
  year      = {2021}
}

8. Contact for Issues

Bumsoo Kim, [email protected]

9. License

This project is licensed under the terms of the Apache License 2.0. Copyright 2021 Kakao Brain Corp. https://www.kakaobrain.com All Rights Reserved.

Comments
  • output of hoi matcher

    output of hoi matcher

    hello, i spend some time to read the code of hotr_matcher.py, but i couldn't understand it clearly. when i debug the code, the output of hotr_matcher.py is [(tensor([1]), tensor([0])), (tensor([4]), tensor([0]))] with bs=2, could you explain the meaning of this output? each element represents what? thanks very much, looking forward to your reply!

    opened by SISTMrL 4
  • demo code

    demo code

    Dear author: Thanks for sharing the training and validation code. A lot of researchers, just like me, want to quickly try your work and test it on custom images. Would you kindly share the demo code to inference on single image? Thank you very much.

    opened by dragen1860 3
  • pretrained detr on hicodet

    pretrained detr on hicodet

    hello, have you tried to finetune the whole structure of detr on hicodet datasets later? i want to know the performance gap with detr without finetune.

    opened by SISTMrL 2
  • file not found error

    file not found error

    hello, when i'm going to reproduce the hicodet of hotr, i encounter the error follows: FileNotFoundError: [Errno 2] No such file or directory: 'hico_20160224_det/list_action.txt'

    could you provide the list_action.txt for me? thanks! i use the hicodet datasets i had download before, not your repo

    looking forward to your reply, thanks!

    opened by SISTMrL 2
  • the question of loss log

    the question of loss log

    hello, i read the code of loss calculation, you take loss_value and loss_dict_reduced_scaled to metric_logger as input. but your metric_logger class is difficult for me to understand. so i have three questions about the loss log. as is shown in figure image

    1. i saw each loss have two items such as: loss_act:1.3329(1.5253), how the meaning of 1.3329 and 1.5253
    2. the loss_value is sum of loss_dict_reduced_scaled's values. but when i add the each loss respectively, as is shown in figure image the sum 21.6664 != 24.1092, but the other sum is correct, can you explain it?
    3. when i saw the loss in the training process, which loss should i pay attention to? the loss in the parentheses or loss is outside parenthneses.

    looking foward to your reply! thanks!

    opened by SISTMrL 2
  • Conversion to ONNX

    Conversion to ONNX

    While trying to convert to ONNX from pytorch model I face issue RuntimeError: output 1 (0.209699 [ CPUDoubleType{} ]) of traced region did not have observable data dependence with trace inputs; this probably indicates your program cannot be understood by the tracer.

    Could you check if there's some issue and if so could you suggest a fix

    opened by Sanidhya27 1
  • can not download hico-det  annotation files

    can not download hico-det annotation files

    Thanks for your great work. hico det dataset 's "annotation files " download link is same as "HICO-DET" .it lead to can not download “annotation files”

    opened by JackWhite-rwx 1
  • cope with zero-hoi image  sample?

    cope with zero-hoi image sample?

    Dear author: I noticed that there is at least one hoi annotaion for each image in HICO dataset. and when i try to reduce the hoi categoreies number, it will create some images without any hoi annotation. Thus when training, the Matcher will trigger errors. I wonder How to deal with zero-hoi images when traininig your model. I suppose zsero-hoi sample setting is general in real scenario. Thank you.

    opened by dragen1860 1
  • CUDA out of memory problem

    CUDA out of memory problem

    Thanks for your nice work. When evaluating HOTR on vcoco dataset (vcoco_multi_train) on a server with 8 GeForce RTX 2080Ti Cards, I encountered the CUDA out of memory problem.

    File "/project/HOI/HOTR-main/hotr/engine/evaluator_vcoco.py", line 53, in vcoco_evaluate gather_res = utils.all_gather(res) File "/project/HOI/HOTR-main/hotr/util/misc.py", line 129, in all_gather data_list.append(pickle.loads(buffer)) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/storage.py", line 141, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 595, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 774, in _legacy_load result = unpickler.load() File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 730, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 155, in _cuda_deserialize return storage_type(obj.size()) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/cuda/init.py", line 462, in _lazy_new return super(_CudaBase, cls).new(cls, *args, **kwargs) RuntimeError: CUDA error: out of memory

    This problem seems happens on line 53 of evaluator_vcoco.py, utils.all_gather(res) . Any suggestions how to solve this problem? Thanks a lot.

    opened by GWwangshuo 1
  • parameter temperature

    parameter temperature

    hello, in your readme, you said the temperature is relative with number of decoder. but you don't tell us how to adjust the temperature according the number of decoders. looking forward to your reply, thanks!

    opened by SISTMrL 1
  • HOI for real scenario deployment

    HOI for real scenario deployment

    Dear author: I tested some influential HOI algorithms, such as iCAN, and found these algorithms may perform good in public dataset. However, when transfered to real scenario images, such as the photos taken by my mobile phone, i found the performance dropped severely. My team are trying to make HOI work on some real scenario, therefore I would invite you to reccomend a neat and effective algorithm to use. Which algorithm will you reccommend,? Thank you.

    opened by dragen1860 1
  • Visualised predictions

    Visualised predictions

    Hello and thank you for your great work!

    I am currently training your model with custom data, but I would like to run the predictions on a test data set and visualise them.

    In a previous post you said that you will soon provide the code for this. However, I haven't found anything in the repository yet. If the code is not yet available, can you tell me how and where I can use the predications to visualise them on test images?

    opened by Gaussianer 0
  • HOTR for Custom Data

    HOTR for Custom Data

    Hello I am currently writing my master thesis in the field of HOI detection. I would like to use custom data for this. However, I still lack any clues on how to annotate this data. Can you recommend a tool for this? I would also like to investigate HOTR in more detail in my thesis and write a paper about it. I would be very happy to receive a response.

    Best regards

    opened by Gaussianer 1
  • conda activate

    conda activate

    dear @meliketoy I'm trying to run demo code. In environment setup, There is no activate kakaobrain. could you add that line? It is good for junior developer like me. Thanks!

    opened by saintrealchoi 0
Owner
Kakao Brain
Kakao Brain Corp.
Kakao Brain
Official repository for CVPR21 paper "Deep Stable Learning for Out-Of-Distribution Generalization".

StableNet StableNet is a deep stable learning method for out-of-distribution generalization. This is the official repo for CVPR21 paper "Deep Stable L

null 120 Dec 28, 2022
Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

null 105 Dec 23, 2022
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 687 Jan 7, 2023
[CVPR21] LightTrack: Finding Lightweight Neural Network for Object Tracking via One-Shot Architecture Search

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search The official implementation of the paper LightTra

Multimedia Research 290 Dec 24, 2022
Repository relating to the CVPR21 paper TimeLens: Event-based Video Frame Interpolation

TimeLens: Event-based Video Frame Interpolation This repository is about the High Speed Event and RGB (HS-ERGB) dataset, used in the 2021 CVPR paper T

Robotics and Perception Group 544 Dec 19, 2022
Code for HLA-Face: Joint High-Low Adaptation for Low Light Face Detection (CVPR21)

HLA-Face: Joint High-Low Adaptation for Low Light Face Detection The official PyTorch implementation for HLA-Face: Joint High-Low Adaptation for Low L

Wenjing Wang 77 Dec 8, 2022
[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

 同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University) 842 Jan 4, 2023
Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"

MeshTransformer ✨ This is our research code of End-to-End Human Pose and Mesh Reconstruction with Transformers. MEsh TRansfOrmer is a simple yet effec

Microsoft 473 Dec 31, 2022
Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

Andrew Luo 41 Dec 9, 2022
This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

MOTR: End-to-End Multiple-Object Tracking with TRansformer This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object

null 348 Jan 7, 2023
[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers This is the official PyTorch implementation and models for UP-DETR paper: @a

dddzg 430 Dec 23, 2022
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 9, 2023
Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

LMFD-PAD Note This is the official repository of the paper: LMFD-PAD: Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechani

null 28 Dec 2, 2022
[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

Multimedia Research 196 Dec 13, 2022
[CVPR 2022 Oral] MixFormer: End-to-End Tracking with Iterative Mixed Attention

MixFormer The official implementation of the CVPR 2022 paper MixFormer: End-to-End Tracking with Iterative Mixed Attention [Models and Raw results] (G

Multimedia Computing Group, Nanjing University 235 Jan 3, 2023
End-to-End Object Detection with Fully Convolutional Network

This project provides an implementation for "End-to-End Object Detection with Fully Convolutional Network" on PyTorch.

null 472 Dec 22, 2022
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals, CVPR2021

End-to-End Object Detection with Learnable Proposal, CVPR2021

Peize Sun 1.2k Dec 27, 2022
Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

3DETR: An End-to-End Transformer Model for 3D Object Detection PyTorch implementation and models for 3DETR. 3DETR (3D DEtection TRansformer) is a simp

Facebook Research 487 Dec 31, 2022
Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework

This repo is the official implementation of "Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework". @inproceedings{zhou2021insta

null 34 Dec 31, 2022