End-to-end Temporal Action Detection with Transformer. [Under review]

Xiaolong Liu

Last update: Dec 25, 2022

Related tags

Overview

TadTR: End-to-end Temporal Action Detection with Transformer

By Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Song Bai, Xiang Bai.

This repo holds the code for TadTR, described in the technical report: End-to-end temporal action detection with Transformer

Introduction

TadTR is an end-to-end Temporal Action Detection TRansformer. It has the following advantages over previous methods:

Simple. It adopts a set-prediction pipeline and achieves TAD with a single network. It does not require a separate proposal generation stage.
Flexible. It removes hand-crafted design such as anchor setting and NMS.
Sparse. It produces very sparse detections (e.g. 10 on ActivityNet), thus requiring lower computation cost.
Strong. As a self-contained temporal action detector, TadTR achieves state-of-the-art performance on HACS and THUMOS14. It is also much stronger than concurrent Transformer-based methods.

We're still improving TadTR. Stay tuned for the future version.

Updates

[2021.9.15] Update the performance on THUMOS14.

[2021.9.1] Add demo code.

TODOs

add model code
add inference code
add training code
support training/inference with video input

Main Results

HACS Segments

Method	Feature	mAP@0.5	mAP@0.75	mAP@0.95	Avg. mAP	Model
TadTR	I3D RGB	45.16	30.70	11.78	30.83	[OneDrive]

THUMOS14

Method	Feature	mAP@0.3	mAP@0.4	mAP@0.5	mAP@0.6	mAP@0.7	Avg. mAP	Model
TadTR	I3D 2stream	72.92	66.86	58.59	46.31	32.32	55.40	[OneDrive]
TadTR	TSN 2stream	64.24	58.34	50.01	40.79	29.07	48.49	[OneDrive]

ActivityNet-1.3

Method	Feature	mAP@0.5	mAP@0.75	mAP@0.95	Avg. mAP	Model
TadTR+BMN	TSN 2stream	50.51	35.35	8.18	34.55	[OneDrive]

Install

Requirements

Linux, CUDA>=9.2, GCC>=5.4
Python>=3.7
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here)
Other requirements
```
pip install -r requirements.txt
```

Compiling CUDA extensions

cd model/ops;

# If you have multiple installations of CUDA Toolkits, you'd better add a prefix
# CUDA_HOME=<your_cuda_toolkit_path> to specify the correct version. 
python setup.py build_ext --inplace

Run a quick test

python demo.py

Data Preparation

To be updated.

Training

Run the following command

bash scripts/train.sh DATASET

Testing

bash scripts/test.sh DATASET WEIGHTS

Acknowledgement

The code is based on the DETR and Deformable DETR. We also borrow the implementation of the RoIAlign1D from G-TAD. Thanks for their great works.

Citing

@article{liu2021end,
  title={End-to-end Temporal Action Detection with Transformer},
  author={Liu, Xiaolong and Wang, Qimeng and Hu, Yao and Tang, Xu and Bai, Song and Bai, Xiang},
  journal={arXiv preprint arXiv:2106.10271},
  year={2021}
}

Contact

For questions and suggestions, please contact Xiaolong Liu at "liuxl at hust dot edu dot cn".

Comments

Reproducibility of ActivityNet

Hi, first thanks for your great work. I am trying to reproduce your results in ActivityNet. I follow the operations in your paper. Using TSP features and add some codes in Dataset module. I can run through whole process in ActivityNet but i just cannot get results as good as you present in the paper. For me, the results drop all about 3-4%. I am wondering whether you have planning to open source the train code for ActivityNet?

opened by yyccli 4
One question about the loss backward of temporal_deform_attn

Thanks open source for this good work.

But, I met a problem.

models/ops/temporal_deform_attn/functions/temporal_deform_attn_func.py", line 40, in backward value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, grad_output, ctx.seq2col_step) RuntimeError: Not implemented

I wonder if it is convenient for you to answer.

opened by kimsimple 4
how to combine with classifier?

Hi @xlliu7,

Interesting paper! I want to know how to combine your model with the classifier? e.g. PGCN in Table 1. Would you mind sharing the code? Thanks.

opened by wjn922 3
How to generate th14_i3d2s_ft_info.json?

Hello, thank you for your good work! I want to know how to generate th14_i3d2s_ft_info.json for thumos14 video features. And how to compute ''feature_length", "feature_second" and "feature_fps" for each video?

opened by Gttgithub 2
No training/inference code or weights

Hi! I'm really interested in using this work for action detection - is there any way I could get access to your training scripts and pretrained weights?

opened by linden-li 1
Different lengths of Thumos14 I3D Features

Hi, xiaolong. I'm very interested in your work. As you mentioned in another issue, you use the I3D features form P-GCN for the Thumos14 experiment. I find that some features for the same video have different sizes so that I can't concat them directly. And the diff is always 1. Have you ever met this situation. If ever, how you deal with it? Thx~

opened by ZhiqiangFong 1
Modification of focal loss for it to works with mix-up augmentation?

I'm trying to train on relatively small datasets, mix-up is one way to reduce it from overfitting, but it seems like focal loss is not designed to works with label with probabilities. It seems that this line https://github.com/xlliu7/TadTR/blob/3af0abcb17a20210ddd04d2c7e212a024ea0fedc/models/tadtr.py#L274 specifically designed for binary classification.

Do you have any idea how to modify focal loss for label with probabilities?

opened by rtxbae 0

Owner

Xiaolong Liu

PhD student @ HUST | Deep learning | computer vision | action recognition

GitHub

ICLR2021 (Under Review)

Self-Supervised Time Series Representation Learning by Inter-Intra Relational Reasoning This repository contains the official PyTorch implementation o

58 Dec 30, 2022

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

Lottery Jackpots Exist in Pre-trained Models (Paper Link) Requirements Python >= 3.7.4 Pytorch >= 1.6.1 Torchvision >= 0.4.1 Reproduce the Experiment

27 Jun 28, 2022

The official TensorFlow implementation of the paper Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

Action Transformer A Self-Attention Model for Short-Time Human Action Recognition This repository contains the official TensorFlow implementation of t

20 Jan 3, 2023

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

3DETR: An End-to-End Transformer Model for 3D Object Detection PyTorch implementation and models for 3DETR. 3DETR (3D DEtection TRansformer) is a simp

487 Dec 31, 2022

Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Spatio-Temporal Entropy Model A Pytorch Reproduction of Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression. More details can

16 Nov 28, 2022

End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model

onnx-facial-lmk-detector End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model, model.onnx. Demo You can

42 Dec 30, 2022

Allows including an action inside another action (by preprocessing the Yaml file). This is how composite actions should have worked.

actions-includes Allows including an action inside another action (by preprocessing the Yaml file). Instead of using uses or run in your action step,

70 Nov 4, 2022

Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21).

ACTION-Net Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21). Getting Started EgoGesture data folder struct

171 Dec 26, 2022

Human Action Controller - A human action controller running on different platforms.

Human Action Controller (HAC) Goal A human action controller running on different platforms. Fun Easy-to-use Accurate Anywhere Fun Examples Mouse Cont

27 Jul 20, 2022

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Long Short-Term Transformer for Online Action Detection Introduction This is a PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short

77 Dec 16, 2022

TDN: Temporal Difference Networks for Efficient Action Recognition

TDN: Temporal Difference Networks for Efficient Action Recognition Overview We release the PyTorch code of the TDN(Temporal Difference Networks).

Multimedia Computing Group, Nanjing University

326 Dec 13, 2022

Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

AFSD: Learning Salient Boundary Feature for Anchor-free Temporal Action Localization This is an official implementation in PyTorch of AFSD. Our paper

146 Dec 24, 2022

A pytorch-version implementation codes of paper: "BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation"

BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation A pytorch-version implementation

11 Oct 8, 2022

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

5 Sep 16, 2022

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

This repository is the official PyTorch implementation of Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

4 Dec 11, 2022

End-to-end Temporal Action Detection with Transformer. [Under review]

Related tags

Overview

TadTR: End-to-end Temporal Action Detection with Transformer

Introduction

Updates

TODOs

Main Results

Install

Requirements

Compiling CUDA extensions

Run a quick test

Data Preparation

Training

Testing

Acknowledgement

Citing

Contact

Comments

Reproducibility of ActivityNet

One question about the loss backward of temporal_deform_attn

how to combine with classifier?

How to generate th14_i3d2s_ft_info.json?

No training/inference code or weights

Different lengths of Thumos14 I3D Features

Modification of focal loss for it to works with mix-up augmentation?

Owner

Xiaolong Liu

ICLR2021 (Under Review)

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

The official TensorFlow implementation of the paper Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model

Allows including an action inside another action (by preprocessing the Yaml file). This is how composite actions should have worked.

Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21).

Human Action Controller - A human action controller running on different platforms.

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

TDN: Temporal Difference Networks for Efficient Action Recognition

Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

A pytorch-version implementation codes of paper: "BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation"

AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation

Spatial Temporal Graph Convolutional Networks (ST-GCN) for Skeleton-Based Action Recognition in PyTorch

Efficient Two-Step Networks for Temporal Action Segmentation (Neurocomputing 2021)

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity