PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Overview

Long Short-Term Transformer for Online Action Detection

Introduction

This is a PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

network

Environment

  • The code is developed with CUDA 10.2, Python >= 3.7.7, PyTorch >= 1.7.1

    1. [Optional but recommended] create a new conda environment.

      conda create -n lstr python=3.7.7
      

      And activate the environment.

      conda activate lstr
      
    2. Install the requirements

      pip install -r requirements.txt
      

Data Preparation

  1. Download the THUMOS'14 and TVSeries datasets.

  2. Extract feature representations for video frames.

    • For ActivityNet pretrained features, we use the ResNet-50 model for the RGB and optical flow inputs. We recommend to use this checkpoint in MMAction2.

    • For Kinetics pretrained features, we use the ResNet-50 model for the RGB inputs. We recommend to use this checkpoint in MMAction2. We use the BN-Inception model for the optical flow inputs. We recommend to use the model here.

    Note: We compute the optical flow using DenseFlow.

  3. If you want to use our dataloaders, please make sure to put the files as the following structure:

    • THUMOS'14 dataset:

      $YOUR_PATH_TO_THUMOS_DATASET
      ├── rgb_kinetics_resnet50/
      |   ├── video_validation_0000051.npy (of size L x 2048)
      │   ├── ...
      ├── flow_kinetics_bninception/
      |   ├── video_validation_0000051.npy (of size L x 1024)
      |   ├── ...
      ├── target_perframe/
      |   ├── video_validation_0000051.npy (of size L x 22)
      |   ├── ...
      
    • TVSeries dataset:

      $YOUR_PATH_TO_TVSERIES_DATASET
      ├── rgb_kinetics_resnet50/
      |   ├── Breaking_Bad_ep1.npy (of size L x 2048)
      │   ├── ...
      ├── flow_kinetics_bninception/
      |   ├── Breaking_Bad_ep1.npy (of size L x 1024)
      |   ├── ...
      ├── target_perframe/
      |   ├── Breaking_Bad_ep1.npy (of size L x 31)
      |   ├── ...
      
  4. Create softlinks of datasets:

    cd long-short-term-transformer
    ln -s $YOUR_PATH_TO_THUMOS_DATASET data/THUMOS
    ln -s $YOUR_PATH_TO_TVSERIES_DATASET data/TVSeries
    

Training

Training LSTR with 512 seconds long-term memory and 8 seconds short-term memory requires less 3 GB GPU memory.

The commands are as follows.

cd long-short-term-transformer
# Training from scratch
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES
# Finetuning from a pretrained model
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT

Online Inference

There are three kinds of evaluation methods in our code.

  • First, you can use the config SOLVER.PHASES "['train', 'test']" during training. This process devides each test video into non-overlapping samples, and makes prediction on the all the frames in the short-term memory as if they were the latest frame. Note that this evaluation result is not the final performance, since (1) for most of the frames, their short-term memory is not fully utlized and (2) for simplicity, samples in the boundaries are mostly ignored.

    cd long-short-term-transformer
    # Inference along with training
    python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
        SOLVER.PHASES "['train', 'test']"
    
  • Second, you could run the online inference in batch mode. This process evaluates all video frames by considering each of them as the latest frame and filling the long- and short-term memories by tracing back in time. Note that this evaluation result matches the numbers reported in the paper, but batch mode cannot be further accelerated as descibed in paper's Sec 3.6. On the other hand, this mode can run faster when you use a large batch size, and we recomand to use it for performance benchmarking.

    cd long-short-term-transformer
    # Online inference in batch mode
    python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
        MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE batch
    
  • Third, you could run the online inference in stream mode. This process tests frame by frame along the entire video, from the beginning to the end. Note that this evaluation result matches the both LSTR's performance and runtime reported in the paper. It processes the entire video as LSTR is applied to real-world scenarios. However, currently it only supports to test one video at each time.

    cd long-short-term-transformer
    # Online inference in stream mode
    python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
        MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE stream DATA.TEST_SESSION_SET "['$VIDEO_NAME']"
    

Evaluation

Evaluate LSTR's performance for online action detection using perframe mAP or mcAP.

cd long-short-term-transformer
python tools/eval/eval_perframe --pred_scores_file $PRED_SCORES_FILE

Evaluate LSTR's performance at different action stages by evaluating each decile (ten-percent interval) of the video frames separately.

cd long-short-term-transformer
python tools/eval/eval_perstage --pred_scores_file $PRED_SCORES_FILE

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@inproceedings{xu2021long,
	title={Long Short-Term Transformer for Online Action Detection},
	author={Xu, Mingze and Xiong, Yuanjun and Chen, Hao and Li, Xinyu and Xia, Wei and Tu, Zhuowen and Soatto, Stefano},
	booktitle={Conference on Neural Information Processing Systems (NeurIPS)},
	year={2021}
}

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Comments
  • Some details about the feature extraction.

    Some details about the feature extraction.

    Hi, thanks for generously sharing your code. When I try to extract optical flow features of Kinetic using BNInception, I encountered some problems.

    • I don't know the data preprocessing method for BNInception. Could you please provide a more complete code?
    • I notice there are some configurations to extract optical flow using denseflow. Which configuration of denseflow did you use? Such as: denseflow test.avi -b=20 -a=tvl1 -s=1 -v?
    opened by sqiangcao99 7
  • Reproduced result slightly lower than the reported.

    Reproduced result slightly lower than the reported.

    Hi,

    I have tried the code and get 68.3% on thumos dataset, which is slightly lower than the result on paper (69.5%)

    Can you provide one or two features and label file so that I can compare with and have sanity check?

    opened by junwenchen 6
  • Some details about preparing dataset.

    Some details about preparing dataset.

    Hi, thank you for sharing the code, I met some problems when I prepare the dataset.

    1. I want to know how you extract the feature of TVSeries? Is the input image the original size? Did you do any crop or resize?

    2. I get the target_perframe.npy on the frame-level. For example, if one video time is 200 seconds, FPS is 4, the frame of the video is 800, I get the annotation by .txt and transfer the seconds in the annotaion to frames (by multiply the FPS), and for each frame, if any action happens in this frame,this action corresponds to the value in the vector will plus one. Finally I will get a 800 x number of classes's array. Is there any mistake in my operation?

    3. Would you please share the training log so thao I can compare it with my log?

    opened by Echo0125 3
  • about custom dataset

    about custom dataset

    Thank you to share your great work!

    I have something to ask you.

    I have my own dataset that I want to apply your LSTR to, and for this I did feature extraction for rgb and flow respectively according to this https://github.com/open-mmlab/mmaction2/tree/master/tools/data/activitynet. However, the feature results came out in pkl format. To use your dataloader, it must be in npy format. How can I get the features out in npy format like you?

    opened by tghim 2
  • How to extract perframe features using mmaction2?

    How to extract perframe features using mmaction2?

    Hi,

    I'm a beginner of video understanding and OAD. I have installed mmaction2 and denseflow, it works well. But I'm not sure how to extract perframe features using it.

    1. Is there any ready-made apis to solve the above problem? I've taken much time to search the solutions, but I didn't find.
    2. If no ready-made apis, could you plz share related codes with us? It's important for beginners.

    Thank you for your awesome repo. And hope your reply.

    opened by Prot-debug 2
  • How to do optical flow data preprocessing before sending to the bninception net?

    How to do optical flow data preprocessing before sending to the bninception net?

    @xumingze0308 Hi Xu,

    I followed the url https://github.com/yjxiong/action-detection/blob/master/transforms.py that you mentionded in other issues. But I found that this data preprocessing is so inefficient, that is, processing one .jpg files frame by frame. And it's easy to cause memory leak problems by using the function PIL.Image.open(). I don't know where I am going wrong, my code process is as follows.

    def transforms_img(img_list):
    
        trans = torchvision.transforms.Compose([
            GroupScale(256),
            GroupRandomCrop(224),
            Stack(),
            ToTorchFormatTensor(),
            GroupNormalize(
                mean=[128],
                std=[128]
            )]
        )
    
        for i, img_dir in enumerate(img_list):
            with open(img_dir, 'rb') as open_file:
                img = Image.open(open_file).convert('L')
                color_group = [img]
                rst = trans(color_group)
                del color_group
                del img
            if i == 0:
                stack_img = rst
                del rst
            else:
                stack_img = torch.cat((stack_img, rst), dim=0)
                del rst
        gc.collect() 
    
    
    opened by Prot-debug 1
  • How to test the speed of LSTR?

    How to test the speed of LSTR?

    Hi, Many factors can affect the inference speed of the model. Such as the evaluation method(batch mode, stream mode) of the LSTR. Could you please provide more details, such as the batch_size? It would be better if you could provide the speed testing code.

    opened by sqiangcao99 1
  • How to handle imbalance in dataset for TVSeries?

    How to handle imbalance in dataset for TVSeries?

    Hi,

    Thanks for the awesome code repo. I observed that there is a large imbalance in dataset. For TVSeries, >70% of data is background class.

    1. I was wondering if you tried any technique to deal with the imbalance in the dataset? The LSTR paper does not mention any such technique.

    2. My validation loss does not converge and most predictions are of background class. My mcAP does not increase over epochs. What metric do you recommend to track/debug over epochs for performance of non-background classes on TVSeries?

    Thanks!

    opened by miteshksingh 1
  • Help wanted

    Help wanted

    I want some assistance with regard to the data preparation and execution? I have downloaded the required files but got some errors in the execution. A rapid response would be greatly appreciated. Thanks

    opened by Quadwo 1
  • RN50 pretrained backbone

    RN50 pretrained backbone

    Hi. Thanks for sharing the codebase of LSTR. I was unable to get hold of this Kinetics-pretrained RN50 checkpoint which you have mentioned in your readme. It would be quite helpful if you could share this pretrained backbone (and if possible, for ActivityNet too).

    Thanks.

    opened by priyamdey 1
  • about data analysis

    about data analysis

    Sorry to bother you.I notice the decreasing of test_loss and the mAP are not synchronized ideally with the epoch increasing.So I’m not sure when stopping the training, and how calculating the mAP in fig3?

    opened by 007invictus 0
Owner
null
[NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning

SoCo [NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning By Fangyun Wei*, Yue Gao*, Zhirong Wu, Han Hu,

Yue Gao 137 Nov 2, 2022
[NeurIPS 2021 Spotlight] Code for Learning to Compose Visual Relations

Learning to Compose Visual Relations This is the pytorch codebase for the NeurIPS 2021 Spotlight paper Learning to Compose Visual Relations. Demo Imag

Nan Liu 87 Oct 20, 2022
PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Adam-NSCL This is a PyTorch implementation of Adam-NSCL algorithm for continual learning from our CVPR2021 (oral) paper: Title: Training Networks in N

Shipeng Wang 33 Nov 3, 2022
Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)

UPDeT Official Implementation of UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers (ICLR 2021 spotlight) The

hhhusiyi 94 Nov 21, 2022
Code for our NeurIPS 2021 paper Mining the Benefits of Two-stage and One-stage HOI Detection

CDN Code for our NeurIPS 2021 paper "Mining the Benefits of Two-stage and One-stage HOI Detection". Contributed by Aixi Zhang*, Yue Liao*, Si Liu, Mia

null 67 Nov 17, 2022
Code to reproduce the experiments from our NeurIPS 2021 paper " The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective"

Code To run: python runner.py new --save <SAVE_NAME> --data <PATH_TO_DATA_DIR> --dataset <DATASET> --model <model_name> [options] --n 1000 - train - t

Geoff Pleiss 4 Nov 8, 2021
Code for our NeurIPS 2021 paper 'Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation'

Exploiting the Intrinsic Neighborhood Structure for Source-free Domain Adaptation (NeurIPS 2021) Code for our NeurIPS 2021 paper 'Exploiting the Intri

Shiqi Yang 50 Nov 22, 2022
[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

Shengyu Zhao 361 Nov 14, 2022
[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Undistillable: Making A Nasty Teacher That CANNOT teach students "Undistillable: Making A Nasty Teacher That CANNOT teach students" Haoyu Ma, Tianlong

VITA 69 Nov 11, 2022
Code for "The Intrinsic Dimension of Images and Its Impact on Learning" - ICLR 2021 Spotlight

dimensions Estimating the instrinsic dimensionality of image datasets Code for: The Intrinsic Dimensionaity of Images and Its Impact On Learning - Phi

Phil Pope 42 Nov 16, 2022
PyTorch implementation of NeurIPS 2021 paper: "CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration"

PyTorch implementation of NeurIPS 2021 paper: "CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration"

null 75 Nov 16, 2022
This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

Transferability for domain generalization This repo is for evaluating and improving transferability in domain generalization (NeurIPS 2021), based on

gordon 8 Aug 20, 2022
Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

Ren Tianhe 49 Nov 10, 2022
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 211 Nov 10, 2022
PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Saim Wani 4 May 8, 2022
This codebase is the official implementation of Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (NeurIPS2021, Spotlight)

Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization This codebase is the official implementation of Test-Time Classifier A

null 40 Nov 10, 2022
Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

Ng Kam Woh 68 Nov 3, 2022
Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

CSA: Contextual Similarity Aggregation with Self-attention for Visual Re-ranking PyTorch training code for CSA (Contextual Similarity Aggregation). We

Hui Wu 19 Oct 21, 2022
The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

machen 10 Sep 27, 2022