Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization' (ICCV-21 Oral)

Pilhyeon Lee

Last update: Jan 3, 2023

Related tags

Deep Learning deep-learning pytorch weakly-supervised-learning temporal-action-localization action-completeness point-level-supervision

Overview

Learning-Action-Completeness-from-Points

Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization' (ICCV 2021 Oral)

Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization
Pilhyeon Lee (Yonsei Univ.), Hyeran Byun (Yonsei Univ.)

Paper: https://arxiv.org/abs/2108.05029

Abstract: We tackle the problem of localizing temporal intervals of actions with only a single frame label for each action instance for training. Owing to label sparsity, existing work fails to learn action completeness, resulting in fragmentary action predictions. In this paper, we propose a novel framework, where dense pseudo-labels are generated to provide completeness guidance for the model. Concretely, we first select pseudo background points to supplement point-level action labels. Then, by taking the points as seeds, we search for the optimal sequence that is likely to contain complete action instances while agreeing with the seeds. To learn completeness from the obtained sequence, we introduce two novel losses that contrast action instances with background ones in terms of action score and feature similarity, respectively. Experimental results demonstrate that our completeness guidance indeed helps the model to locate complete action instances, leading to large performance gains especially under high IoU thresholds. Moreover, we demonstrate the superiority of our method over existing state-of-the-art methods on four benchmarks: THUMOS'14, GTEA, BEOID, and ActivityNet. Notably, our method even performs comparably to recent fully-supervised methods, at the 6 times cheaper annotation cost.

Prerequisites

Recommended Environment

Python 3.6
Pytorch 1.6
Tensorflow 1.15 (for Tensorboard)
CUDA 10.2

Depencencies

You can set up the environments by using $ pip3 install -r requirements.txt.

Data Preparation

Prepare THUMOS'14 dataset.
- We excluded three test videos (270, 1292, 1496) as previous work did.
Extract features with two-stream I3D networks
- We recommend extracting features using this repo.
- For convenience, we provide the features we used. You can find them here.
Place the features inside the dataset folder.
- Please ensure the data structure is as below.

├── dataset
   └── THUMOS14
       ├── gt.json
       ├── split_train.txt
       ├── split_test.txt
       ├── fps_dict.json
       ├── point_gaussian
           └── point_labels.csv
       └── features
           ├── train
               ├── rgb
                   ├── video_validation_0000051.npy
                   ├── video_validation_0000052.npy
                   └── ...
               └── flow
                   ├── video_validation_0000051.npy
                   ├── video_validation_0000052.npy
                   └── ...
           └── test
               ├── rgb
                   ├── video_test_0000004.npy
                   ├── video_test_0000006.npy
                   └── ...
               └── flow
                   ├── video_test_0000004.npy
                   ├── video_test_0000006.npy
                   └── ...

Usage

Running

You can easily train and evaluate the model by running the script below.

If you want to try other training options, please refer to options.py.

$ bash run.sh

Evaulation

The pre-trained model can be found here. You can evaluate the model by running the command below.

$ bash run_eval.sh

References

We note that this repo was built upon our previous models.

Background Suppression Network for Weakly-supervised Temporal Action Localization (AAAI 2020) [paper] [code]
Weakly-supervised Temporal Action Localization by Uncertainty Modeling (AAAI 2021) [paper] [code]

We referenced the repos below for the code.

In addition, we referenced a part of code in the following repo for the greedy algorithm implementation.

NeuralNetwork-Viterbi

Citation

If you find this code useful, please cite our paper.

@inproceedings{lee2021completeness,
  title={Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization},
  author={Pilhyeon Lee and Hyeran Byun},
  booktitle={IEEE/CVF International Conference on Computer Vision},
  year={2021},
}

Contact

If you have any question or comment, please contact the first author of the paper - Pilhyeon Lee ([email protected]).

Comments

Query regarding transcript in optimal transport

Thanks for making this awesome work publicly available !

I wanted to know what is the meaning if the term "transcript" in "search.py" ? I cannot understand the pattern why sometimes [0,1] is given and [1] is used sometimes. Can you kndly elaborate ?

opened by sauradip 6
About video-level probability

Thanks for your excellent job! I am confused why express video-level probability by that: vid_score = (torch.mean(topk_scores, dim=1) * vid_labels) + (torch.mean(cas_sigmoid[:,:,:-1], dim=1) * (1 - vid_labels)) Inconsistent between training and testing.

opened by yangjiangeyjg 2
About feature extractions

You've mentioned that your work's feature extractions' part was followed https://github.com/piergiaj/pytorch-i3d, but when I tried to apply it to my own datasets, I found that the dimention of layer 'logits.conv3d' is mismatch.

Traceback (most recent call last): File "extract_features.py", line 88, in run(mode=args.mode, load_model=args.load_model) File "extract_features.py", line 51, in run i3d.load_state_dict(torch.load(load_model)) File "/home/pengfang/.conda/envs/mvit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for InceptionI3d: size mismatch for logits.conv3d.weight: copying a param with shape torch.Size([400, 1024, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([54, 1024, 1, 1, 1]). size mismatch for logits.conv3d.bias: copying a param with shape torch.Size([400]) from checkpoint, the shape in current model is torch.Size([54]).

Do I need to finetune the I3D on my datasets? Could you tell me how you apply this code to Thumos14?

opened by Yuuuumie 1
about thumos14 label

Hello, in thumbos14, CliffDiving is a subclass of Diving, and the action instances of CliffDiving in the annotation file also belong to Diving. Why don't you use this prior knowledge to remove the action instance of CliffDiving class in the Diving class during training and add a Diving class for each predicted CliffDiving action instance during post-processing? I think an action instance belonging to two categories may make the training difficult to converge.

opened by menghuaa 12
How to reproduce the GTEA

Is there any trick to reproduce the result of GTEA , could you please giving config.txt for this dataset and it will be convenient for reproducing. Thank you.@Pilhyeon

opened by chenrxi 6
For GTEA and BEOID

Hello, thanks for your excellent job. I am interested in your work so much! May I ask if the extracted features on ActivityNet, GTEA and BEOID will be released？

opened by PHDJieFu 3

Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization' (ICCV-21 Oral)

Related tags

Overview

Learning-Action-Completeness-from-Points

Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization' (ICCV 2021 Oral)

Prerequisites

Recommended Environment

Depencencies

Data Preparation

Usage

Running

Evaulation

References

Citation

Contact

Comments

Query regarding transcript in optimal transport

About video-level probability

About feature extractions

about thumos14 label

How to reproduce the GTEA

For GTEA and BEOID

Owner

Pilhyeon Lee

Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.

PyTorch implementation of ''Background Activation Suppression for Weakly Supervised Object Localization''.

Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation (CVPR 2022)

Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

Codes for TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization.

CVPR2021: Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Official implementation of the ICCV 2021 paper: "The Power of Points for Modeling Humans in Clothing".

Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral)

SSL_SLAM2: Lightweight 3-D Localization and Mapping for Solid-State LiDAR (mapping and localization separated) ICRA 2021

Python scripts performing class agnostic object localization using the Object Localization Network model in ONNX.

Code for the paper "Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds" (ICCV 2021)

Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set (CVPRW 2019). A PyTorch implementation.

Hybrid CenterNet - Hybrid-supervised object detection / Weakly semi-supervised object detection

The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Official PyTorch implementation of "Adversarial Reciprocal Points Learning for Open Set Recognition"

Official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition" in AAAI2022.

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).