[CVPR2022] Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

Last update: Dec 23, 2022

Related tags

Deep Learning Bridge-Prompt

Overview

Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

Created by Muheng Li, Lei Chen, Yueqi Duan, Zhilan Hu, Jianjiang Feng, Jie Zhou, Jiwen Lu

This repository contains PyTorch implementation for Bridge-Prompt (CVPR 2022).

We propose a prompt-based framework, Bridge-Prompt (Br-Prompt), to model the semantics across multiple adjacent correlated actions, so that it simultaneously exploits both out-of-context and contextual information from a series of ordinal actions in instructional videos. More specifically, we reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics. The generated text prompts are paired with corresponding video clips, and together co-train the text encoder and the video encoder via a contrastive approach. The learned vision encoder has a stronger capability for ordinal-action-related downstream tasks, e.g. action segmentation and human activity recognition.

Our code is based on CLIP and ActionCLIP.

Prerequisites

Requirements

PyTorch >= 1.8
wandb
dotmap
yaml
pprint
tqdm
RandAugment

You may need ffmpeg for video data pre-processing.

The environment is also recorded in requirements.txt, which can be reproduced by

pip install -r requirements.txt

Pretrained models

We use the base model (ViT-B/16 for image encoder & text encoder) pre-trained by ActionCLIP based on Kinetics-400. The model can be downloaded in link (pwd:ilgw). The pre-trained model should be saved in ./models/.

Datasets

Raw video files are needed to train our framework. Please download the datasets with RGB videos from the official websites ( Breakfast / GTEA / 50Salads ) and save them under the folder ./data/(name_dataset). For convenience, we have used the extracted frames of the raw RGB videos as inputs. You can extract the frames from raw RGB datasets by running:

python preprocess/get_frames.py --dataset (name_dataset) --vpath (folder_to_your_videos) --fpath ./data/(name_dataset)/frames/

To be noticed, ffmpeg is needed here for frame extraction.

Furthermore, please also extract the .zip files to ./data/(name_dataset) respectively.

Training

To train Bridge-Prompt on Breakfast from Kinetics400 pretrained models, you can run:

bash scripts/run_train.sh  ./configs/breakfast/breakfast_ft.yaml

To train Bridge-Prompt on GTEA from Kinetics400 pretrained models, you can run:

bash scripts/run_train.sh  ./configs/gtea/gtea_ft.yaml

To train Bridge-Prompt on 50Salads from Kinetics400 pretrained models, you can run:

bash scripts/run_train.sh  ./configs/salads/salads_ft.yaml

Extracting frame features

We use the Bridge-Prompt pre-trained image encoders to extract frame-wise features for further downstream tasks (e.g. action segmentation). You can run the following command for each dataset respectively:

python extract_frame_features.py --config ./configs/(dataset_name)/(dataset_name)_exfm.yaml --dataset (dataset_name)

Since 50Salads/Breakfast are large scale datasets, we extract the frame features by window splits. To combine the splits, please run the following command:

python preprocess/combine_features.py

Please modify the variables dataset and feat_name in combine_features.py for each dataset.

Action segmentation

You can reproduce the action segmentation results using ASFormer by the previously extracted frame features.

Activity recognition

You can reproduce the activity recognition results using the command:

python ft_acti.py

based on the previously extracted frame features (Breakfast).

Ordinal action recognition

The ordinal action inferences are executed using the command:

bash scripts/run_test.sh  ./configs/(dataset_name)/(dataset_name)_test.yaml

and check the accuracies using:

bash preprocess/checknpy.py

Please modify the variables dataset in checknpy.py for each dataset.

Notes

Please modify pretrain in all config files according to your own working directions.

License

MIT License.

Comments

Having problem reproducing action segmentation result

Hello,

I've followed all of the steps mentioned in README.md with GTEA dataset, and using extracted frame features (gtea_vit_features_splt1), I've followed ASFormer's steps (download pretrained model, download dataset and replace features with extracted one and run eval, which is step 1, 2, 3, and 4 in ASFormer's README.md)

but following error occurs and I could not handle it python main.py --action=predict --dataset=gtea --split=1

UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" Model Size: 1130860 Traceback (most recent call last): File "/home/jckim/ASFormer/main.py", line 97, in <module> trainer.predict(model_dir, results_dir, features_path, batch_gen_tst, num_epochs, actions_dict, sample_rate) File "/home/jckim/ASFormer/model.py", line 406, in predict batch_input, batch_target, mask, vids = batch_gen_tst.next_batch(1) File "/home/jckim/ASFormer/batch_gen.py", line 101, in next_batch classes[i] = self.actions_dict[content[i]] KeyError: '<take><bread> (24-76) [0]'

I've also tried replacing all data with your data/gtea.zip not using data.zip given by ASFormer repository. Still doesn't work

is there anything that i've done wrong? (If downloading pretrained model you uploaded is the only solution, I can't download using BAIDU because signing up to baidu is not currently available foreign user's account)

Thank you in advance!

opened by TikaToka 6
Which epoch is the pretrain in *_test.yaml
Hello, thank you very much for your detailed sharing. I have the following questions:

Which epoch is the “pretrain" key in gtea_test.yaml and salads_test.yml? As you give in breakfast_test.yml, pretrain: "./exp/clip_ucf/ViT-B/16/breakfast/20211114_052417_splt1_adj/14_epoch.pt", you explicitly mentioned 14 epoch.

Why are there * and *_wc in ordinal action recognition? What does WC mean？And what are your considerations in this?

Looking forward to your reply!
opened by laohuijiadezhu 4
inferior reproduction performance

Hi~ Thank you for your amazing work !

When I conducted the action segmentation experiments on GTEA dataset with your official public code by myself, I found the reproduced results are inferior to ones that paper reported. I am wondering what I missed in the reproduction. Here allow me to depict several details of my trial.

I argue that each of 4 splits should be trained with Bridge-Prompt respectively so that their test splits will not participate in the training process, which might be not mentioned in the paper as well as the code README.md.

So I adjusted n_split in gtea_ft.yaml to utilize train_split1/2/3/4_nf16_ol[2, 1, 0.5]_ds[1, 2, 4].npy, the rest remained the same, to train the model separately. After that, each last model was regarded as the corresponding pretrained model of the specific split (by updating pretrain in gtea_exfm.yaml and log_time in extract_frame_features.py) for the purpose of extracting frame-wise features. (BTW, I present these two files I configured as .txt format at the end of this issue.)

In other words, each of 4 splits had its specified frame-wise features, which were further served as the model inputs of ASFormer. That is to say, formally I obtained four folder gtea_vit_features_split1/2/3/4 to deposit extracted features for each individual split.

To successfully match ASFormer input size (features_dim, n_frames), I transposed the final feature presentations (n_frames, features_dim) and simply rectified the features_dim=768 within main.py of ASFormer source code. As proposed in their paper, I train 120 epochs and evaluated the performance with the 120-epoch.model.

The reproduced per and average results of 4 splits on GTEA dataset are shown below: python eval.py --dataset=gtea --split=1 Acc: 81.8661 Edit: 88.776471 F1@10,25,50 [91.17647059 88.97058824 80.14705882] python eval.py --dataset=gtea --split=2 Acc: 80.4668 Edit: 87.354703 F1@10,25,50 [87.94326241 85.81560284 77.30496454] python eval.py --dataset=gtea --split=3 Acc: 81.1862 Edit: 92.991235 F1@10,25,50 [93.23308271 90.97744361 87.21804511] python eval.py --dataset=gtea --split=4 Acc: 83.7258 Edit: 90.182133 F1@10,25,50 [94.73684211 93.98496241 83.45864662] python eval.py --dataset=gtea --split=0 Acc: 81.8112 Edit: 89.826136 F1@10,25,50 [91.77241445290323, 89.93714927180278, 82.03217877296495]

Here I reveal the logs training Bridge-Prompt on GTEA from provided pretrained models vit-16-32f.pt: Name | State | Notes | User | Runtime | lr | train_loss_acts | train_loss_all | train_loss_cnt | train_total_loss | Created -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 20220413_184847_clip_ucf_ViT-B/16_gtea | finished | gtea_split1 | yo3nglau | 15457 | 7.16E-12 | 0.56103516 | 0.10571289 | 0.59570313 | 1.26269531 | 2022-04-13T10:48:51.000Z 20220413_122249_clip_ucf_ViT-B/16_gtea | finished | gtea_split2 | yo3nglau | 15340 | 2.91E-12 | 0.20373535 | 0.02615356 | 0.47509766 | 0.70507813 | 2022-04-13T04:22:56.000Z 20220412_114851_clip_ucf_ViT-B/16_gtea | finished | gtea_split3 | yo3nglau | 22225 | 1.71E-11 | 0.25561523 | 0.07971191 | 0.06726074 | 0.40283203 | 2022-04-12T03:48:55.000Z 20220411_215453_clip_ucf_ViT-B/16_gtea | finished | gtea_split4 | yo3nglau | 16942 | 4.05E-12 | 0.3918457 | 0.05957031 | 0.13122559 | 0.58251953 | 2022-04-11T13:54:57.000Z

I am very confused and urgently look forward to receiving your patient guidance !

Best regards

gtea_ft.txt gtea_exfm.txt

opened by yo3nglau 4
pretrain model for feature extractor in 50salads
Hi! Thanks for your great work. I tried to extract features from 50salads dataset using this command.

python extract_frame_features.py --config ./configs/salads/salads_exfm.yaml --dataset 50salads

I don't know why this yaml file reference breakfast pretrain model path. https://github.com/ttlmh/Bridge-Prompt/blob/master/configs/salads/salads_exfm.yaml#L1 If I re-experment your works in 50 salads dataset, should I modify salads configs? (example: salads/20211111_121454_splt1/last_model.pt in your upload file)

Thanks.
opened by habakan 3
Finetune features

Hi, Thanks for your amazing work. Here is a question: how to fine-tune the feature extractor （ViT/b）on the specific dataset (like GTEA)? I suppose that the video with the segmentation label will be split into various segments and each segment should be treated as a trimmed action video. I want to know whether I am right :) By the way, would you please release fine-tuned model? Thanks for your efforts.

opened by hitcbw 3
Combine Features Dimension Mismatch
The following error occurred when I executed the command python preprocess/combine_features.py.

...... P47_webcam02_P47_cereals is combined. Traceback (most recent call last): File "preprocess/combine_features.py", line 31, in <module> result[row.ind:, :] = tfeat[:diff, :] ValueError: could not broadcast input array from shape (32,768) into shape (36,768)

The previous features have been successfully combined, but there is a dimension problem when combining this feature. I can't figure out what caused it. How can I solve it? I really need your help and look forward to your answers. Thank you very much!
opened by laohuijiadezhu 2
Which dataset version should be downloaded

Hello, thank you very much for your detailed README and answers of issues. According to README, I click the links to enter the official websites of Breakfast and GTEA. However, several download options were found for raw video. Breakfast: GTEA:

I don't know which to choose. Can you tell me your choice? I look forward to your reply!

opened by laohuijiadezhu 2
Diagonal Matrix or Symmetric Matrix

Thank you for sharing your excellent work. I have some questions about the pipeline.gif. In the loss function, video features and text features will form a matrix. Some of matrices are diagonal matrices and some are symmetric matrices. I don't understand the matrix type. I think the symmetric matrix is video and prompt one by one, and the diagonal matrix is multiple videos correspond to a prompt text. For example, the loss of stat is a symmetric matrix in the pipeline. I think the positive pair of z[CNT] is only one, so Lstat should be a diagonal matrix. Predicted statistics should correspond to the prompt statistics one by one. I look forward to your reply. Thanks.

opened by laohuijiadezhu 0
How to reproduce action segmentation

It is mentioned in the paper that the features extracted in the "Extracting frame features" (in README.md) section are used as the training input of ASFormer. I read the source code and README of ASFormer. I think there are two steps, prediction and evaluation, that need to be run by myself. For prediction I should not be able to directly copy commands. The source code may need to be modified. Where should I modify it? Or is there any other operation to implement? I'm looking forward to your answer. Thanks!

opened by laohuijiadezhu 10
encoder release

Hello, would it be possible for you to release the final vision encoder weights (or the extracted features) for the three datasets? It would be really convenient to speed up the development of new models based on your visual features. Thank you!

opened by FrCha 2
model select and sample strategy
Thanks for your interesting work! When I repeat your work on my dataset with the task of action segmentation, I have another two questions:

how to select the best backbone model? I try to use the loss on the validation set to select, but the embeddings extracted by the minimal validation loss model are not working well when I use Asformer as the temporal classifier.

have you ever compared a different strategy of downsampling and overlap? I wonder if the model works well on the action clip that contains 4-5 actions than contains 2-3 actions.
opened by Lycus99 1
数据集准备
Hi，非常感谢你们的工作。我希望在自己的数据集上复现你们的网络，遇到了以下几个问题：

为什么datasets.py下对一个数据集（比如Breakfast）有很多个类（Breakfast, Breakfast_feat, Breakfast_acti, Breakfast_FRAMES...）

在GTEA，SALADS dataset的getitem下，都有如下代码： if self.pretrain: vid = torch.from_numpy(vid) vid = torch.unique_consecutive(vid) vid = vid.numpy() vid = np.ma.masked_equal(vid, 0) vid = vid.compressed() vid = np.pad(vid, (0, 10 - vid.shape[0]), 'constant', constant_values=(0, -1)) 我认为这部分代码的意思是用来统计一个clip中出现的动作，用于生成text prompt。但是这改变了原始的label标签，将[0,0,0,0,0,1,1,2,2,2,2,2,3,3,3,4]变成了[ 1, 2, 3, 4, -1, -1, -1, -1, -1, -1]。这种改变后的label是如何计算loss的？
opened by Lycus99 1

Owner

Graduate student of Tsinghua University. Major in Automation.

GitHub

Towards Implicit Text-Guided 3D Shape Generation (CVPR2022)

Towards Implicit Text-Guided 3D Shape Generation Towards Implicit Text-Guided 3D Shape Generation (CVPR2022) Code for the paper [Towards Implicit Text

55 Dec 16, 2022

Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

E2FGVI (CVPR 2022) English | 简体中文 This repository contains the official implementation of the following paper: Towards An End-to-End Framework for Flo

Media Computing Group @ Nankai University

537 Jan 7, 2023

Allows including an action inside another action (by preprocessing the Yaml file). This is how composite actions should have worked.

actions-includes Allows including an action inside another action (by preprocessing the Yaml file). Instead of using uses or run in your action step,

70 Nov 4, 2022

Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21).

ACTION-Net Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21). Getting Started EgoGesture data folder struct

171 Dec 26, 2022

Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization' (ICCV-21 Oral)

Learning-Action-Completeness-from-Points Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal A

67 Jan 3, 2023

Human Action Controller - A human action controller running on different platforms.

Human Action Controller (HAC) Goal A human action controller running on different platforms. Fun Easy-to-use Accurate Anywhere Fun Examples Mouse Cont

27 Jul 20, 2022

The official TensorFlow implementation of the paper Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

Action Transformer A Self-Attention Model for Short-Time Human Action Recognition This repository contains the official TensorFlow implementation of t

20 Jan 3, 2023

Towards Part-Based Understanding of RGB-D Scans

Towards Part-Based Understanding of RGB-D Scans (CVPR 2021) We propose the task of part-based scene understanding of real-world 3D environments: from

26 Nov 23, 2022

Towards Long-Form Video Understanding

Towards Long-Form Video Understanding Chao-Yuan Wu, Philipp Krähenbühl, CVPR 2021 [Paper] [Project Page] [Dataset] Citation @inproceedings{lvu2021,

69 Dec 26, 2022

[ICML 2021] Towards Understanding and Mitigating Social Biases in Language Models

Towards Understanding and Mitigating Social Biases in Language Models This repo contains code and data for evaluating and mitigating bias from generat

42 Jan 3, 2023

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

38 Dec 12, 2022

Official PyTorch implementation of "IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos", CVPRW 2021

IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos Introduction This repo is official PyTorch implementatio

29 Sep 24, 2022

Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Capti

61 Oct 11, 2022

Towards uncontrained hand-object reconstruction from RGB videos

Towards uncontrained hand-object reconstruction from RGB videos Yana Hasson, Gül Varol, Ivan Laptev and Cordelia Schmid Project page Paper Table of Co

69 Dec 27, 2022

This is an official implementation of the CVPR2022 paper "Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots".

Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots Blind2Unblind Citing Blind2Unblind @inproceedings{wang2022blind2unblind, tit

58 Dec 6, 2022

The official codes of our CVPR2022 paper: A Differentiable Two-stage Alignment Scheme for Burst Image Reconstruction with Large Shift

TwoStageAlign The official codes of our CVPR2022 paper: A Differentiable Two-stage Alignment Scheme for Burst Image Reconstruction with Large Shift Pa

32 Dec 15, 2022