As a part of the HAKE project, includes the reproduced SOTA models and the corresponding HAKE-enhanced versions (CVPR2020).

Overview

HAKE-Action

HAKE-Action (TensorFlow) is a project to open the SOTA action understanding studies based on our Human Activity Knowledge Engine. It includes reproduced SOTA models and their HAKE-enhanced versions. HAKE-Action is authored by Yong-Lu Li, Xinpeng Liu, Liang Xu, Cewu Lu. Currently, it is manintained by Yong-Lu Li, Xinpeng Liu and Liang Xu.

News: (2021.10.06) Our extended version of SymNet is accepted by TPAMI! Paper and code are coming soon.

(2021.2.7) Upgraded HAKE-Activity2Vec is released! Images/Videos --> human box + ID + skeleton + part states + action + representation. [Description]

Full demo: [YouTube], [bilibili]

(2021.1.15) Our extended version of TIN (Transferable Interactiveness Network) is accepted by TPAMI! New paper and code will be released soon.

(2020.10.27) The code of IDN (Paper) in NeurIPS'20 is released!

(2020.6.16) Our larger version HAKE-Large (>120K images, activity and part state labels) is released!

We released the HAKE-HICO (image-level part state labels upon HICO) and HAKE-HICO-DET (instance-level part state labels upon HICO-DET). The corresponding data can be found here: HAKE Data.

  • Paper is here.
  • More data and part states (e.g., upon AVA, more kinds of action categories, more rare actions...) are coming.
  • We will keep updating HAKE-Action to include more SOTA models and their HAKE-enhanced versions.

Data Mode

  • HAKE-HICO (PaStaNet* mode in paper): image-level, add the aggression of all part states in an image (belong to one or multiple active persons), compared with original HICO, the only additional labels are image-level human body part states.

  • HAKE-HICO-DET (PaStaNet* in paper): instance-level, add part states for each annotated persons of all images in HICO-DET, the only additional labels are instance-level human body part states.

  • HAKE-Large (PaStaNet in paper): contains more than 120K images, action labels and the corresponding part state labels. The images come from the existing action datasets and crowdsourcing. We mannully annotated all the active persons with our novel part-level semantics.

  • GT-HAKE (GT-PaStaNet* in paper): GT-HAKE-HICO and G-HAKE-HICO-DET. It means that we use the part state labels as the part state prediction. That is, we can perfectly estimate the body part states of a person. Then we use them to infer the instance activities. This mode can be seen as the upper bound of our HAKE-Action. From the results below we can find that, the upper bound is far beyond the SOTA performance. Thus, except for the current study on the conventional instance-level method, continue promoting part-level method based on HAKE would be a very promising direction.

Notion

Activity2Vec and PaSta-R are our part state based modules, which operate action inference based on part semantics, different from previous instance semantics. For example, Pairwise + HAKE-HICO pre-trained Activity2Vec + Linear PaSta-R (the seventh row) achieves 45.9 mAP on HICO. More details can be found in our CVPR2020 paper: PaStaNet: Toward Human Activity Knowledge Engine.

Code

The two versions of HAKE-Action are relesased in two branches of this repo:

Models on HICO

Instance-level +Activity2Vec +PaSta-R mAP Few@1 Few@5 Few@10
R*CNN - - 28.5 - - -
Girdhar et.al. - - 34.6 - - -
Mallya et.al. - - 36.1 - - -
Pairwise - - 39.9 13.0 19.8 22.3
- HAKE-HICO Linear 44.5 26.9 30.0 30.7
Mallya et.al. HAKE-HICO Linear 45.0 26.5 29.1 30.3
Pairwise HAKE-HICO Linear 45.9 26.2 30.6 31.8
Pairwise HAKE-HICO MLP 45.6 26.0 30.8 31.9
Pairwise HAKE-HICO GCN 45.6 25.2 30.0 31.4
Pairwise HAKE-HICO Seq 45.9 25.3 30.2 31.6
Pairwise HAKE-HICO Tree 45.8 24.9 30.3 31.8
Pairwise HAKE-Large Linear 46.3 24.7 31.8 33.1
Pairwise HAKE-Large Linear 46.3 24.7 31.8 33.1
Pairwise GT-HAKE-HICO Linear 65.6 47.5 55.4 56.6

Models on HICO-DET

Using Object Detections from iCAN

Instance-level +Activity2Vec +PaSta-R Full(def) Rare(def) None-Rare(def) Full(ko) Rare(ko) None-Rare(ko)
iCAN - - 14.84 10.45 16.15 16.26 11.33 17.73
TIN - - 17.03 13.42 18.11 19.17 15.51 20.26
iCAN HAKE-HICO-DET Linear 19.61 17.29 20.30 22.10 20.46 22.59
TIN HAKE-HICO-DET Linear 22.12 20.19 22.69 24.06 22.19 24.62
TIN HAKE-Large Linear 22.65 21.17 23.09 24.53 23.00 24.99
TIN GT-HAKE-HICO-DET Linear 34.86 42.83 32.48 35.59 42.94 33.40

Models on AVA (Frame-based)

Method +Activity2Vec +PaSta-R mAP
AVA-TF-Baseline - - 11.4
LFB-Res-50-baseline - - 22.2
LFB-Res-101-baseline - - 23.3
AVA-TF-Baeline HAKE-Large Linear 15.6
LFB-Res-50-baseline HAKE-Large Linear 23.4
LFB-Res-101-baseline HAKE-Large Linear 24.3

Models on V-COCO

Method +Activity2Vec +PaSta-R AP(role), Scenario 1 AP(role), Scenario 2
iCAN - - 45.3 52.4
TIN - - 47.8 54.2
iCAN HAKE-Large Linear 49.2 55.6
TIN HAKE-Large Linear 51.0 57.5

Training Details

We first pre-train the Activity2Vec and PaSta-R with activities and PaSta labels. Then we change the last FC in PaSta-R to fit the activity categories of the target dataset. Finally, we freeze Activity2Vec and fine-tune PaSta-R on the train set of the target dataset. Here, HAKE works like the ImageNet and Activity2Vec is used as a pre-trained knowledge engine to promote other tasks.

Citation

If you find our work useful, please consider citing:

@inproceedings{li2020pastanet,
  title={PaStaNet: Toward Human Activity Knowledge Engine},
  author={Li, Yong-Lu and Xu, Liang and Liu, Xinpeng and Huang, Xijie and Xu, Yue and Wang, Shiyi and Fang, Hao-Shu and Ma, Ze and Chen, Mingyang and Lu, Cewu},
  booktitle={CVPR},
  year={2020}
}
@inproceedings{li2019transferable,
  title={Transferable Interactiveness Knowledge for Human-Object Interaction Detection},
  author={Li, Yong-Lu and Zhou, Siyuan and Huang, Xijie and Xu, Liang and Ma, Ze and Fang, Hao-Shu and Wang, Yanfeng and Lu, Cewu},
  booktitle={CVPR},
  year={2019}
}
@inproceedings{lu2018beyond,
  title={Beyond holistic object recognition: Enriching image understanding with part states},
  author={Lu, Cewu and Su, Hao and Li, Yonglu and Lu, Yongyi and Yi, Li and Tang, Chi-Keung and Guibas, Leonidas J},
  booktitle={CVPR},
  year={2018}
}

HAKE

HAKE[website] is a new large-scale knowledge base and engine for human activity understanding. HAKE provides elaborate and abundant body part state labels for active human instances in a large scale of images and videos. With HAKE, we boost the action understanding performance on widely-used human activity benchmarks. Now we are still enlarging and enriching it, and looking forward to working with outstanding researchers around the world on its applications and further improvements. If you have any pieces of advice or interests, please feel free to contact Yong-Lu Li ([email protected]).

If you get any problems or if you find any bugs, don't hesitate to comment on GitHub or make a pull request!

HAKE-Action is freely available for free non-commercial use, and may be redistributed under these conditions. For commercial queries, please drop an e-mail. We will send the detail agreement to you.

Comments
  • TIN+PaStaNet*-Linear 的实验结果,需要TIN和PaSta的联合训练吗?

    TIN+PaStaNet*-Linear 的实验结果,需要TIN和PaSta的联合训练吗?

    这个工程主要是在对A2V网络进行预训练和微调。而在Generate_detection.py文件中直接引入了TIN的检测结果,并且通过 sHO = (((sH + sO) * ssp + sP + sA + sL) * score_I[im_index[keys[obj_index]], x:y]) * hod 这条语句融合TIN和A2V的结果。针对此我有三点疑惑:

    1: 这样做需要TIN和A2V Net的联合训练吗?还是分别独立训练之后,直接在这里进行score的融合就可以了?

    2: 我将sHO融合式子中的sP + sA + sL去掉之后(只用TIN检测结果),发现在HICO上的性能依然很好,和TIN+PaStaNet的效果只差1两个百分点。这是不是说明A2V Net并没有为检测结果提供很大的贡献呢? ('Default: ', 0.22116964775393347) ('Default rare: ', 0.20193286258301021) ('Default non-rare: ', 0.22691570046732615) ('Known object: ', 0.24059134635026103) ('Known object, rare: ', 0.2219354706179083) ('Known object, non-rare: ', 0.24616388065992487)

    3:Generate_detection.py文件中,对score_P,score_A,score_L都先除以各自的fac,再通过sigmoid函数转化为概率。score_P,score_A,score_L这两个在A2V中取出的时候已经是概率了,除以fac的意义是什么?为什么要再次通过Sigmoid函数? score_P[obj_index][:, hoi_index - x] /= P_fac[hoi_index] score_A[obj_index][:, hoi_index - x] /= A_fac[hoi_index] score_L[obj_index][:, hoi_index - x] /= L_fac[hoi_index]

    sP = torch.sigmoid(torch.from_numpy(score_P[obj_index])).cpu().numpy() * P_weight sA = torch.sigmoid(torch.from_numpy(score_A[obj_index])).cpu().numpy() * A_weight sL = torch.sigmoid(torch.from_numpy(score_L[obj_index])).cpu().numpy() * L_weight

    期待您的回复!

    opened by Shunli-Wang 9
  • Instance-level HOI Demo

    Instance-level HOI Demo

    Hi,

    It seems that this code does not include the part of human part or object detection. I don't understand why iCAN is used as the object detection? iCAN seems to use Faster-RCNN for object detection.

    Is there any way to input an image and output the corresponding instance-level human interaction behavior as a demo?

    Thanks, Tairan Chen

    opened by chentairan 9
  • Do I need P_fac, A_fac, L_fac for image-level output?

    Do I need P_fac, A_fac, L_fac for image-level output?

    I noticed the Generate_detection.py that there are P_fac, A_fac, L_fac for factorize score_P, score_A and score_L. Also, there is sigmoid function for calculating score_P, score_A and score_L. But in image-level output, there is no sigmoid function and factorization. Do I need them as well?

    opened by whqwill 7
  • "Error parsing text-format caffe.NetParameter" in Image-level-HAKE-Action

    When I execute python scripts/train.py --pasta-mode linear in docker container, language_model.log is generated. And I checked this file,there are some problems:

    I0916 11:37:38.907941   553 caffe.cpp:217] Using GPUs 0
    I0916 11:37:38.949647   553 caffe.cpp:222] GPU 0: GeForce GTX 1080 Ti
    I0916 11:37:39.380765   553 solver.cpp:51] Initializing solver from parameters: 
    test_iter: 2000
    test_interval: 1000000
    base_lr: 1e-05
    display: 20
    max_iter: 600000
    lr_policy: "cosine"
    gamma: 0.1
    momentum: 0.9
    weight_decay: 0.0005
    stepsize: 30000
    snapshot: 10000
    snapshot_prefix: "snaps/language-model"
    device_id: 0
    net: "models/Language-Models/train_language-model.prototxt"
    train_state {
      level: 0
      stage: ""
    }
    test_initialization: false
    average_loss: 100
    iter_size: 16
    snapshot_format: HDF5
    train_size: 38116
    te: 2
    multfactor: 2
    tt: 2.05884722928
    tenext: 2
    I0916 11:37:39.381000   553 solver.cpp:94] Creating training net from net file: models/Language-Models/train_language-model.prototxt
    [libprotobuf ERROR google/protobuf/text_format.cc:274] Error parsing text-format caffe.NetParameter: 3124:7: Message type "caffe.LayerParameter" has no field named "layer".
    F0916 11:37:39.382678   553 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: models/Language-Models/train_language-model.prototxt
    *** Check failure stack trace: ***
        @     0x7feae92775cd  google::LogMessage::Fail()
        @     0x7feae9279433  google::LogMessage::SendToLog()
        @     0x7feae927715b  google::LogMessage::Flush()
        @     0x7feae9279e1e  google::LogMessageFatal::~LogMessageFatal()
        @     0x7feae9b77ac1  caffe::ReadNetParamsFromTextFileOrDie()
        @     0x7feae996e6eb  caffe::Solver<>::InitTrainNet()
        @     0x7feae996f9c7  caffe::Solver<>::Init()
        @     0x7feae996fd6a  caffe::Solver<>::Solver()
        @     0x7feae995a6c3  caffe::Creator_SGDSolver<>()
        @           0x40afb9  train()
        @           0x4077c8  main
        @     0x7feae7a0d830  __libc_start_main
        @           0x408099  _start
        @              (nil)  (unknown)
    

    I googled this problem, it seems to be the reason for the different version of caffe. So I want to ask what is the version of caffe for this project. My caffe version is:

    >>> caffe.__version__
    '1.0.0-rc3'
    
    opened by LSC333 6
  • Wired results: obtain the similar reported result without training...

    Wired results: obtain the similar reported result without training...

    Thanks for your released code. However, I faces a strage problem that the HAKE-Action seems do not need to train. I run the code using the command:

    python tools/Train_pasta_HICO_DET.py --data 0 --init_weight 1 --train_module 2 --num_iteration 11 --model test_pastanet

    Then, I test the model 'test_pastanet/HOI_iter_10.ckpt'.

    Unbelievably! The result is:

    Default:  0.2194696378436301                                                                                                                                                                                       
    Default rare:  0.20412592656443512                                                                                                                                                                                 
    Default non-rare:  0.22405282432962342                                                                                                           
    Known object:  0.2382635240705128                                                                                                               
    Known object, rare:  0.22197228395195176                                                                                                                                                                           
    Known object, non-rare:  0.24312973865138168
    

    Have you ever tested your code like this? This result is ....... I'm trying to test snapshot 1. I do not change the code.

    opened by xxxzhi 6
  • Does iCAN result contains human-object interactioness relation?

    Does iCAN result contains human-object interactioness relation?

    Does iCAN detection result contains human-object relation? then you needn't estimate which object one person interacts with, Only need to predict which part interacts with this object?

    opened by xiadingZ 6
  • iamge-level中的evaluate用matlab部分中用的测试文件是哪一个

    iamge-level中的evaluate用matlab部分中用的测试文件是哪一个

    iamge-level中的evaluate用matlab部分中用的测试文件是哪一个。我测试得到的最后结果是linear-final.csv或者fused.csv,但是我看了matlab中的代码的标注label格式是1,0,-1,(分别代表什么意思这里我也不懂),但是linear-final.csv或者fused.csv的格式是一个600维的数组,看起来是激活值,并不是1,0,-1这样的格式,那么最后的结果我怎么得到1,0,-1呢

    opened by whqwill 5
  • TIN scores format?

    TIN scores format?

    I have been able to run PaStaNet to generate bboxes, hdet, keys, odet, scores_A, scores_L, and scores_P on custom images, but in the Generate_detection.py file there are a bunch of TIN scores used to generate the detections:

    score_H = pickle.load(open('TIN/score_H.pkl', 'rb')) score_O = pickle.load(open('TIN/score_O.pkl', 'rb')) score_sp = pickle.load(open('TIN/score_sp.pkl', 'rb')) hdet = pickle.load(open('TIN/hdet.pkl', 'rb')) odet = pickle.load(open('TIN/odet.pkl', 'rb')) keys = pickle.load(open('TIN/keys.pkl', 'rb')) pos = pickle.load(open('TIN/pos.pkl', 'rb')) neg = pickle.load(open('TIN/neg.pkl', 'rb')) bboxes = pickle.load(open('TIN/bboxes.pkl', 'rb'))

    There is also a score_I pulled from the hico_caffe600.h5 file, a key_mapping.pkl, and the generation args file.

    I am hoping y'all can release a description of how to generate these files / the formatting, or release the code used to generate these from TIN results / any other necessary sources. I already have TIN running but am unsure of how to format the results for Generate_detection.py. Thank you!

    opened by anon112233 5
  • A question about the test stage of Instance-level branch

    A question about the test stage of Instance-level branch

    Thanks for your great job on both HOI recognition/detection! I downloaded your code and tried to reproduce your results in the paper, but in the 75L~77L of -Results/Generate_detection.py I felt confused about the details:

    sP = torch.sigmoid(torch.from_numpy(score_P[obj_index]).cuda()).cpu().numpy() * P_weight
    sA = torch.sigmoid(torch.from_numpy(score_A[obj_index]).cuda()).cpu().numpy() * A_weight
    sL = torch.sigmoid(torch.from_numpy(score_L[obj_index]).cuda()).cpu().numpy() * L_weight
    

    As shown, in your code sP, sA, sL are multiplied with P_weight, A_weight and L_weight respectively. The three weights are defined in 42L:

    h_fac, o_fac, sp_fac, P_fac, A_fac, L_fac, hthresh, othresh, athresh, bthresh, P_weight, A_weight, L_weight = pickle.load(open('generation_args.pkl', 'rb'))
    

    I carefully checked the content of generation_args.pkl and found that only P_weight has the value of 1.0, while A_weight and L_weight are both 1e-5. And after I removed sA and sL from the final score, the mAP didn't decrease, which means both Attention branch and Language branch will not contribute to the final results with the weight 1e-5. Is 1e-5 the best weight that you found? Or did I make anything wrong? The problem is important since the reproduction of your ablation study strongly relies on this part.

    opened by hwfan 5
  • Part Boxes Issue in `Test_all_part.pkl`  file

    Part Boxes Issue in `Test_all_part.pkl` file

    When I run the testing command of HICO-DET in Instance-level-detection, I check the Test_all_part.pkl file carefully. I found that all part boxes in this file are all the same. There must be something wrong when running the PaStaNet. How does this happen?

    Here is the screenshot.: image

    opened by Shunli-Wang 4
  • why only ii == 3 in fused.py?

    why only ii == 3 in fused.py?

    fused.py: for ii in range(1, len(sys.argv)): print sys.argv[ii] a1 = np.loadtxt(sys.argv[ii], delimiter = ',', dtype = str,usecols=range(600)) arr1 = a1.astype(np.float) for i in range(arr1.shape[0]): for j in range(arr1.shape[1]): if ii == 3: global_array[i][j] += arr1[i][j] * 2

    why only ii == 3 here? it seems it only uses pairwise.csv result.

    opened by whqwill 4
Owner
Yong-Lu Li
Ph.D. CV_Robotics
Yong-Lu Li
"Segmenter: Transformer for Semantic Segmentation" reproduced via mmsegmentation

Segmenter-based-on-OpenMMLab "Segmenter: Transformer for Semantic Segmentation, arxiv 2105.05633." reproduced via mmsegmentation. We reproduce Segment

EricKani 22 Feb 24, 2022
A modified version of DeepMind's Alphafold2 to divide CPU part (MSA and template searching) and GPU part (prediction model)

ParallelFold Author: Bozitao Zhong This is a modified version of DeepMind's Alphafold2 to divide CPU part (MSA and template searching) and GPU part (p

Bozitao Zhong 77 Dec 22, 2022
Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pytorch Lightning 1.4k Jan 1, 2023
Models Supported: AlbUNet [18, 34, 50, 101, 152] (1D and 2D versions for Single and Multiclass Segmentation, Feature Extraction with supports for Deep Supervision and Guided Attention)

AlbUNet-1D-2D-Tensorflow-Keras This repository contains 1D and 2D Signal Segmentation Model Builder for AlbUNet and several of its variants developed

Sakib Mahmud 1 Nov 15, 2021
A collection of SOTA Image Classification Models in PyTorch

A collection of SOTA Image Classification Models in PyTorch

sithu3 85 Dec 30, 2022
Pytorch implementation of CVPR2020 paper “VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation”

VectorNet Re-implementation This is the unofficial pytorch implementation of CVPR2020 paper "VectorNet: Encoding HD Maps and Agent Dynamics from Vecto

null 120 Jan 6, 2023
PyTorch reimplementation of minimal-hand (CVPR2020)

Minimal Hand Pytorch Unofficial PyTorch reimplementation of minimal-hand (CVPR2020). you can also find in youtube or bilibili bare hand youtube or bil

Hao Meng 228 Dec 29, 2022
Code for the Active Speakers in Context Paper (CVPR2020)

Active Speakers in Context This repo contains the official code and models for the "Active Speakers in Context" CVPR 2020 paper. Before Training The c

null 43 Oct 14, 2022
Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

CMPC-Refseg Code of our CVPR 2020 paper Referring Image Segmentation via Cross-Modal Progressive Comprehension. Shaofei Huang*, Tianrui Hui*, Si Liu,

spyflying 55 Dec 1, 2022
This is an implementation for the CVPR2020 paper "Learning Invariant Representation for Unsupervised Image Restoration"

Learning Invariant Representation for Unsupervised Image Restoration (CVPR 2020) Introduction This is an implementation for the paper "Learning Invari

GarField 88 Nov 7, 2022
Densely Connected Search Space for More Flexible Neural Architecture Search (CVPR2020)

DenseNAS The code of the CVPR2020 paper Densely Connected Search Space for More Flexible Neural Architecture Search. Neural architecture search (NAS)

Jamin Fong 291 Nov 18, 2022
An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020

UnpairedSR An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020 turn RCAN(modified) --> xmodel(xilinx

JiaKui Hu 10 Oct 28, 2022
RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020)

RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020) Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng [PDF] [Supplementary M

Hong Wang 6 Sep 27, 2022
Super Pix Adv - Offical implemention of Robust Superpixel-Guided Attentional Adversarial Attack (CVPR2020)

Super_Pix_Adv Offical implemention of Robust Superpixel-Guided Attentional Adver

DLight 8 Oct 26, 2022
The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

Hailo Model Zoo The Hailo Model Zoo provides pre-trained models for high-performance deep learning applications. Using the Hailo Model Zoo you can mea

Hailo 50 Dec 7, 2022
Project page of the paper 'Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network' (ECCVW 2018)

EPSR (Enhanced Perceptual Super-resolution Network) paper This repo provides the test code, pretrained models, and results on benchmark datasets of ou

Subeesh Vasu 78 Nov 19, 2022
A hobby project which includes a hand-gesture based virtual piano using a mobile phone camera and OpenCV library functions

Overview This is a hobby project which includes a hand-gesture controlled virtual piano using an android phone camera and some OpenCV library. My moti

Abhinav Gupta 1 Nov 19, 2021
LWCC: A LightWeight Crowd Counting library for Python that includes several pretrained state-of-the-art models.

LWCC: A LightWeight Crowd Counting library for Python LWCC is a lightweight crowd counting framework for Python. It wraps four state-of-the-art models

Matija Teršek 39 Dec 28, 2022
Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

Phil Wang 109 Dec 28, 2022