[CVPR2022] Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

Overview

Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

Created by Muheng Li, Lei Chen, Yueqi Duan, Zhilan Hu, Jianjiang Feng, Jie Zhou, Jiwen Lu

This repository contains PyTorch implementation for Bridge-Prompt (CVPR 2022).

We propose a prompt-based framework, Bridge-Prompt (Br-Prompt), to model the semantics across multiple adjacent correlated actions, so that it simultaneously exploits both out-of-context and contextual information from a series of ordinal actions in instructional videos. More specifically, we reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics. The generated text prompts are paired with corresponding video clips, and together co-train the text encoder and the video encoder via a contrastive approach. The learned vision encoder has a stronger capability for ordinal-action-related downstream tasks, e.g. action segmentation and human activity recognition.

intro

Our code is based on CLIP and ActionCLIP.

Prerequisites

Requirements

You may need ffmpeg for video data pre-processing.

The environment is also recorded in requirements.txt, which can be reproduced by

pip install -r requirements.txt

Pretrained models

We use the base model (ViT-B/16 for image encoder & text encoder) pre-trained by ActionCLIP based on Kinetics-400. The model can be downloaded in link (pwd:ilgw). The pre-trained model should be saved in ./models/.

Datasets

Raw video files are needed to train our framework. Please download the datasets with RGB videos from the official websites ( Breakfast / GTEA / 50Salads ) and save them under the folder ./data/(name_dataset). For convenience, we have used the extracted frames of the raw RGB videos as inputs. You can extract the frames from raw RGB datasets by running:

python preprocess/get_frames.py --dataset (name_dataset) --vpath (folder_to_your_videos) --fpath ./data/(name_dataset)/frames/

To be noticed, ffmpeg is needed here for frame extraction.

Furthermore, please also extract the .zip files to ./data/(name_dataset) respectively.

Training

  • To train Bridge-Prompt on Breakfast from Kinetics400 pretrained models, you can run:
bash scripts/run_train.sh  ./configs/breakfast/breakfast_ft.yaml
  • To train Bridge-Prompt on GTEA from Kinetics400 pretrained models, you can run:
bash scripts/run_train.sh  ./configs/gtea/gtea_ft.yaml
  • To train Bridge-Prompt on 50Salads from Kinetics400 pretrained models, you can run:
bash scripts/run_train.sh  ./configs/salads/salads_ft.yaml

Extracting frame features

We use the Bridge-Prompt pre-trained image encoders to extract frame-wise features for further downstream tasks (e.g. action segmentation). You can run the following command for each dataset respectively:

python extract_frame_features.py --config ./configs/(dataset_name)/(dataset_name)_exfm.yaml --dataset (dataset_name)

Since 50Salads/Breakfast are large scale datasets, we extract the frame features by window splits. To combine the splits, please run the following command:

python preprocess/combine_features.py

Please modify the variables dataset and feat_name in combine_features.py for each dataset.

Action segmentation

You can reproduce the action segmentation results using ASFormer by the previously extracted frame features.

Activity recognition

You can reproduce the activity recognition results using the command:

python ft_acti.py

based on the previously extracted frame features (Breakfast).

Ordinal action recognition

The ordinal action inferences are executed using the command:

bash scripts/run_test.sh  ./configs/(dataset_name)/(dataset_name)_test.yaml

and check the accuracies using:

bash preprocess/checknpy.py

Please modify the variables dataset in checknpy.py for each dataset.

Notes

Please modify pretrain in all config files according to your own working directions.

License

MIT License.

Comments
  • Having problem reproducing action segmentation result

    Having problem reproducing action segmentation result

    Hello,

    I've followed all of the steps mentioned in README.md with GTEA dataset, and using extracted frame features (gtea_vit_features_splt1), I've followed ASFormer's steps (download pretrained model, download dataset and replace features with extracted one and run eval, which is step 1, 2, 3, and 4 in ASFormer's README.md)

    but following error occurs and I could not handle it python main.py --action=predict --dataset=gtea --split=1

    UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" Model Size: 1130860 Traceback (most recent call last): File "/home/jckim/ASFormer/main.py", line 97, in <module> trainer.predict(model_dir, results_dir, features_path, batch_gen_tst, num_epochs, actions_dict, sample_rate) File "/home/jckim/ASFormer/model.py", line 406, in predict batch_input, batch_target, mask, vids = batch_gen_tst.next_batch(1) File "/home/jckim/ASFormer/batch_gen.py", line 101, in next_batch classes[i] = self.actions_dict[content[i]] KeyError: '<take><bread> (24-76) [0]'

    I've also tried replacing all data with your data/gtea.zip not using data.zip given by ASFormer repository. Still doesn't work

    is there anything that i've done wrong? (If downloading pretrained model you uploaded is the only solution, I can't download using BAIDU because signing up to baidu is not currently available foreign user's account)

    Thank you in advance!

    opened by TikaToka 6
  • Which epoch is the pretrain in *_test.yaml

    Which epoch is the pretrain in *_test.yaml

    Hello, thank you very much for your detailed sharing. I have the following questions:

    1. Which epoch is the “pretrain" key in gtea_test.yaml and salads_test.yml? As you give in breakfast_test.yml, pretrain: "./exp/clip_ucf/ViT-B/16/breakfast/20211114_052417_splt1_adj/14_epoch.pt", you explicitly mentioned 14 epoch.
    2. Why are there * and *_wc in ordinal action recognition? What does WC mean?And what are your considerations in this?

    Looking forward to your reply!

    opened by laohuijiadezhu 4
  • inferior reproduction performance

    inferior reproduction performance

    Hi~ Thank you for your amazing work !

    When I conducted the action segmentation experiments on GTEA dataset with your official public code by myself, I found the reproduced results are inferior to ones that paper reported. I am wondering what I missed in the reproduction. Here allow me to depict several details of my trial.

    I argue that each of 4 splits should be trained with Bridge-Prompt respectively so that their test splits will not participate in the training process, which might be not mentioned in the paper as well as the code README.md.

    So I adjusted n_split in gtea_ft.yaml to utilize train_split1/2/3/4_nf16_ol[2, 1, 0.5]_ds[1, 2, 4].npy, the rest remained the same, to train the model separately. After that, each last model was regarded as the corresponding pretrained model of the specific split (by updating pretrain in gtea_exfm.yaml and log_time in extract_frame_features.py) for the purpose of extracting frame-wise features. (BTW, I present these two files I configured as .txt format at the end of this issue.)

    In other words, each of 4 splits had its specified frame-wise features, which were further served as the model inputs of ASFormer. That is to say, formally I obtained four folder gtea_vit_features_split1/2/3/4 to deposit extracted features for each individual split.

    To successfully match ASFormer input size (features_dim, n_frames), I transposed the final feature presentations (n_frames, features_dim) and simply rectified the features_dim=768 within main.py of ASFormer source code. As proposed in their paper, I train 120 epochs and evaluated the performance with the 120-epoch.model.

    The reproduced per and average results of 4 splits on GTEA dataset are shown below: python eval.py --dataset=gtea --split=1 Acc: 81.8661 Edit: 88.776471 F1@10,25,50 [91.17647059 88.97058824 80.14705882] python eval.py --dataset=gtea --split=2 Acc: 80.4668 Edit: 87.354703 F1@10,25,50 [87.94326241 85.81560284 77.30496454] python eval.py --dataset=gtea --split=3 Acc: 81.1862 Edit: 92.991235 F1@10,25,50 [93.23308271 90.97744361 87.21804511] python eval.py --dataset=gtea --split=4 Acc: 83.7258 Edit: 90.182133 F1@10,25,50 [94.73684211 93.98496241 83.45864662] python eval.py --dataset=gtea --split=0 Acc: 81.8112 Edit: 89.826136 F1@10,25,50 [91.77241445290323, 89.93714927180278, 82.03217877296495]

    Here I reveal the logs training Bridge-Prompt on GTEA from provided pretrained models vit-16-32f.pt: Name | State | Notes | User | Runtime | lr | train_loss_acts | train_loss_all | train_loss_cnt | train_total_loss | Created -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 20220413_184847_clip_ucf_ViT-B/16_gtea | finished | gtea_split1 | yo3nglau | 15457 | 7.16E-12 | 0.56103516 | 0.10571289 | 0.59570313 | 1.26269531 | 2022-04-13T10:48:51.000Z 20220413_122249_clip_ucf_ViT-B/16_gtea | finished | gtea_split2 | yo3nglau | 15340 | 2.91E-12 | 0.20373535 | 0.02615356 | 0.47509766 | 0.70507813 | 2022-04-13T04:22:56.000Z 20220412_114851_clip_ucf_ViT-B/16_gtea | finished | gtea_split3 | yo3nglau | 22225 | 1.71E-11 | 0.25561523 | 0.07971191 | 0.06726074 | 0.40283203 | 2022-04-12T03:48:55.000Z 20220411_215453_clip_ucf_ViT-B/16_gtea | finished | gtea_split4 | yo3nglau | 16942 | 4.05E-12 | 0.3918457 | 0.05957031 | 0.13122559 | 0.58251953 | 2022-04-11T13:54:57.000Z

    I am very confused and urgently look forward to receiving your patient guidance !

    Best regards

    gtea_ft.txt gtea_exfm.txt

    opened by yo3nglau 4
  • pretrain model for feature extractor in 50salads

    pretrain model for feature extractor in 50salads

    Hi! Thanks for your great work. I tried to extract features from 50salads dataset using this command.

    python extract_frame_features.py --config ./configs/salads/salads_exfm.yaml --dataset 50salads
    

    I don't know why this yaml file reference breakfast pretrain model path. https://github.com/ttlmh/Bridge-Prompt/blob/master/configs/salads/salads_exfm.yaml#L1 If I re-experment your works in 50 salads dataset, should I modify salads configs? (example: salads/20211111_121454_splt1/last_model.pt in your upload file)

    Thanks.

    opened by habakan 3
  • Finetune features

    Finetune features

    Hi, Thanks for your amazing work. Here is a question: how to fine-tune the feature extractor (ViT/b)on the specific dataset (like GTEA)? I suppose that the video with the segmentation label will be split into various segments and each segment should be treated as a trimmed action video. I want to know whether I am right :) By the way, would you please release fine-tuned model? Thanks for your efforts.

    opened by hitcbw 3
  • Combine Features Dimension Mismatch

    Combine Features Dimension Mismatch

    The following error occurred when I executed the command python preprocess/combine_features.py.

    ......
    P47_webcam02_P47_cereals is combined.
    Traceback (most recent call last):
      File "preprocess/combine_features.py", line 31, in <module>
        result[row.ind:, :] = tfeat[:diff, :]
    ValueError: could not broadcast input array from shape (32,768) into shape (36,768)
    

    The previous features have been successfully combined, but there is a dimension problem when combining this feature. I can't figure out what caused it. How can I solve it? I really need your help and look forward to your answers. Thank you very much!

    opened by laohuijiadezhu 2
  • Which dataset version should be downloaded

    Which dataset version should be downloaded

    Hello, thank you very much for your detailed README and answers of issues. According to README, I click the links to enter the official websites of Breakfast and GTEA. However, several download options were found for raw video. Breakfast: image GTEA: image

    I don't know which to choose. Can you tell me your choice? I look forward to your reply!

    opened by laohuijiadezhu 2
  • Diagonal Matrix or Symmetric Matrix

    Diagonal Matrix or Symmetric Matrix

    Thank you for sharing your excellent work. I have some questions about the pipeline.gif. In the loss function, video features and text features will form a matrix. Some of matrices are diagonal matrices and some are symmetric matrices. I don't understand the matrix type. I think the symmetric matrix is video and prompt one by one, and the diagonal matrix is multiple videos correspond to a prompt text. For example, the loss of stat is a symmetric matrix in the pipeline. I think the positive pair of z[CNT] is only one, so Lstat should be a diagonal matrix. Predicted statistics should correspond to the prompt statistics one by one. I look forward to your reply. Thanks.

    opened by laohuijiadezhu 0
  • How to reproduce action segmentation

    How to reproduce action segmentation

    It is mentioned in the paper that the features extracted in the "Extracting frame features" (in README.md) section are used as the training input of ASFormer. I read the source code and README of ASFormer. I think there are two steps, prediction and evaluation, that need to be run by myself. For prediction I should not be able to directly copy commands. The source code may need to be modified. Where should I modify it? Or is there any other operation to implement? I'm looking forward to your answer. Thanks!

    opened by laohuijiadezhu 10
  • encoder release

    encoder release

    Hello, would it be possible for you to release the final vision encoder weights (or the extracted features) for the three datasets? It would be really convenient to speed up the development of new models based on your visual features. Thank you!

    opened by FrCha 2
  • model select and sample strategy

    model select and sample strategy

    Thanks for your interesting work! When I repeat your work on my dataset with the task of action segmentation, I have another two questions:

    1. how to select the best backbone model? I try to use the loss on the validation set to select, but the embeddings extracted by the minimal validation loss model are not working well when I use Asformer as the temporal classifier.
    2. have you ever compared a different strategy of downsampling and overlap? I wonder if the model works well on the action clip that contains 4-5 actions than contains 2-3 actions.
    opened by Lycus99 1
  • 数据集准备

    数据集准备

    Hi,非常感谢你们的工作。我希望在自己的数据集上复现你们的网络,遇到了以下几个问题:

    1. 为什么datasets.py下对一个数据集(比如Breakfast)有很多个类(Breakfast, Breakfast_feat, Breakfast_acti, Breakfast_FRAMES...)
    2. 在GTEA,SALADS dataset的getitem下,都有如下代码: if self.pretrain: vid = torch.from_numpy(vid) vid = torch.unique_consecutive(vid) vid = vid.numpy() vid = np.ma.masked_equal(vid, 0) vid = vid.compressed() vid = np.pad(vid, (0, 10 - vid.shape[0]), 'constant', constant_values=(0, -1)) 我认为这部分代码的意思是用来统计一个clip中出现的动作,用于生成text prompt。但是这改变了原始的label标签,将[0,0,0,0,0,1,1,2,2,2,2,2,3,3,3,4]变成了[ 1, 2, 3, 4, -1, -1, -1, -1, -1, -1]。这种改变后的label是如何计算loss的?
    opened by Lycus99 1
Owner
Graduate student of Tsinghua University. Major in Automation.
null
Towards Implicit Text-Guided 3D Shape Generation (CVPR2022)

Towards Implicit Text-Guided 3D Shape Generation Towards Implicit Text-Guided 3D Shape Generation (CVPR2022) Code for the paper [Towards Implicit Text

null 55 Dec 16, 2022
Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

E2FGVI (CVPR 2022) English | 简体中文 This repository contains the official implementation of the following paper: Towards An End-to-End Framework for Flo

Media Computing Group @ Nankai University 537 Jan 7, 2023
Allows including an action inside another action (by preprocessing the Yaml file). This is how composite actions should have worked.

actions-includes Allows including an action inside another action (by preprocessing the Yaml file). Instead of using uses or run in your action step,

Tim Ansell 70 Nov 4, 2022
Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21).

ACTION-Net Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21). Getting Started EgoGesture data folder struct

V-Sense 171 Dec 26, 2022
Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization' (ICCV-21 Oral)

Learning-Action-Completeness-from-Points Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal A

Pilhyeon Lee 67 Jan 3, 2023
Human Action Controller - A human action controller running on different platforms.

Human Action Controller (HAC) Goal A human action controller running on different platforms. Fun Easy-to-use Accurate Anywhere Fun Examples Mouse Cont

null 27 Jul 20, 2022
The official TensorFlow implementation of the paper Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

Action Transformer A Self-Attention Model for Short-Time Human Action Recognition This repository contains the official TensorFlow implementation of t

PIC4SeRCentre 20 Jan 3, 2023
Towards Part-Based Understanding of RGB-D Scans

Towards Part-Based Understanding of RGB-D Scans (CVPR 2021) We propose the task of part-based scene understanding of real-world 3D environments: from

null 26 Nov 23, 2022
Towards Long-Form Video Understanding

Towards Long-Form Video Understanding Chao-Yuan Wu, Philipp Krähenbühl, CVPR 2021 [Paper] [Project Page] [Dataset] Citation @inproceedings{lvu2021,

Chao-Yuan Wu 69 Dec 26, 2022
[ICML 2021] Towards Understanding and Mitigating Social Biases in Language Models

Towards Understanding and Mitigating Social Biases in Language Models This repo contains code and data for evaluating and mitigating bias from generat

Paul Liang 42 Jan 3, 2023
The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

Yu Meng 38 Dec 12, 2022
Official PyTorch implementation of "IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos", CVPRW 2021

IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos Introduction This repo is official PyTorch implementatio

Gyeongsik Moon 29 Sep 24, 2022
Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Capti

Yuqing Song 61 Oct 11, 2022
Towards uncontrained hand-object reconstruction from RGB videos

Towards uncontrained hand-object reconstruction from RGB videos Yana Hasson, Gül Varol, Ivan Laptev and Cordelia Schmid Project page Paper Table of Co

Yana 69 Dec 27, 2022
This is an official implementation of the CVPR2022 paper "Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots".

Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots Blind2Unblind Citing Blind2Unblind @inproceedings{wang2022blind2unblind, tit

demonsjin 58 Dec 6, 2022
Code for the CVPR2022 paper "Frequency-driven Imperceptible Adversarial Attack on Semantic Similarity"

Introduction This is an official release of the paper "Frequency-driven Imperceptible Adversarial Attack on Semantic Similarity" (arxiv link). Abstrac

Leo 21 Nov 23, 2022
PSTR: End-to-End One-Step Person Search With Transformers (CVPR2022)

PSTR (CVPR2022) This code is an official implementation of "PSTR: End-to-End One-Step Person Search With Transformers (CVPR2022)". End-to-end one-step

Jiale Cao 28 Dec 13, 2022
CVPR2022 paper "Dense Learning based Semi-Supervised Object Detection"

[CVPR2022] DSL: Dense Learning based Semi-Supervised Object Detection DSL is the first work on Anchor-Free detector for Semi-Supervised Object Detecti

Bhchen 69 Dec 8, 2022
The official codes of our CVPR2022 paper: A Differentiable Two-stage Alignment Scheme for Burst Image Reconstruction with Large Shift

TwoStageAlign The official codes of our CVPR2022 paper: A Differentiable Two-stage Alignment Scheme for Burst Image Reconstruction with Large Shift Pa

Shi Guo 32 Dec 15, 2022