Syntax-Aware Action Targeting for Video Captioning

Related tags

Deep Learning SAAT
Overview

Syntax-Aware Action Targeting for Video Captioning

Code for SAAT from "Syntax-Aware Action Targeting for Video Captioning" (Accepted to CVPR 2020). The implementation is based on "Consensus-based Sequence Training for Video Captioning".

Dependencies

(Check out the coco-caption and cider projects into your working directory)

Data

Data can be downloaded here (1.6GB). This folder contains:

  • input/msrvtt: annotatated captions (note that val_videodatainfo.json is a symbolic link to train_videodatainfo.json)
  • output/feature: extracted features of IRv2, C3D and Category embeddings
  • output/metadata: preprocessed annotations
  • output/model_svo/xe: model file and generated captions on test videos, the reported result can be reproduced by the model provided in this folder (CIDEr 49.1 for XE training)

Test

make -f SpecifiedMakefile test [options]

Please refer to the Makefile (and opts_svo.py file) for the set of available train/test options. For example, to reproduce the reported result

make -f Makefile_msrvtt_svo test GID=0 EXP_NAME=xe FEATS="irv2 c3d category" BFEATS="roi_feat roi_box" USE_RL=0 CST=0 USE_MIXER=0 SCB_CAPTIONS=0 LOGLEVEL=DEBUG LAMBDA=20

Train

To train the model using XE loss

make -f Makefile_msrvtt_svo train GID=0 EXP_NAME=xe FEATS="irv2 c3d category" BFEATS="roi_feat roi_box" USE_RL=0 CST=0 USE_MIXER=0 SCB_CAPTIONS=0 LOGLEVEL=DEBUG MAX_EPOCH=100 LAMBDA=20

If you want to change the input features, modify the FEATS variable in above commands.

Citation

@InProceedings{Zheng_2020_CVPR,
author = {Zheng, Qi and Wang, Chaoyue and Tao, Dacheng},
title = {Syntax-Aware Action Targeting for Video Captioning},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}

Acknowledgements

  • Pytorch implementation of CST
  • PyTorch implementation of SCST
Comments
  • How do I get the roi_feat for custom video data?

    How do I get the roi_feat for custom video data?

    In your misc/extract_feats_roi.py code, the return value of functions of coco_demo.run_on_opencv_image() are result, top_preds and top_roi_feats.

    But, in the maskrcnn code you linked, only one value (result) is returned as shown in the below.

    image

    How do I get the roi_feat for custom video data?

    opened by dcahn12 10
  • How many GPU memory did you use?

    How many GPU memory did you use?

    Hi, I used a Titan XP to train or test your code, but I meet the error of 'cuda out of memory', so how can I use multi-gpu to train your code or how can I switch the dataset to MSVD?

    opened by tuyunbin 6
  • Why using the ground truth to calculate the test loss?

    Why using the ground truth to calculate the test loss?

    Hi, When you calculate the test loss, it uses pred to calculate the test loss. Like this: pred, gt_seq, gt_logseq, _, _, _ = model(feats, bfeats, labels, labels_svo) But in the forward function of the model, it uses: lan_cont = self.embed(torch.cat((svo_it[:,1:2], it.unsqueeze(1)), 1)) So, why here using the ground truth, that is it (but 'it' comes from the label), to get the test loss? Why don't use the predicted last word to calculate the wholely predicted result then to get the loss, because, in the test phase, we don't see the label until the time to calculate the loss?

    Thank you very much!

    opened by RyanLiut 6
  • Is it right to use 3D-ResNets-Pytorch for 3D motion features extraction?

    Is it right to use 3D-ResNets-Pytorch for 3D motion features extraction?

    Hello amigo! I have a question about feature extraction for 3D motion features. In the paper Syntax-Aware Action Targeting for Video Captioning, the author declares that he used C3D for motion features extraction. However, I am a little bit confused about the difference between '3D-ResNets' and 'C3D'. Are these two the same thing? Looking forward to your reply!

    opened by MarcusNerva 5
  • About regional Dataset

    About regional Dataset

    1. msrvtt_roi_feat.h5
    2. msrvtt_roi_box.h5

    this two h5 files have 10000 datasets respectively。 you just have 6513 videos ,but why you have 10000 datasets each file ? what is the meaning of each dataset ? is it for frames ?

    opened by Dorothylyly 5
  • About extraction of 3D feature

    About extraction of 3D feature

    Hello!I'm trying to use your project code to test my own video. Thanks for providing the code to extract the features. We import the models in 3D-ResNets-Pytorch demo you linked into the misc folder, but we have the following problems when extracting 3D features:

    Traceback (most recent call last): File "./misc/extract_feats_motion.py", line 82, in model, _ = generate_model(opt) File "/media/louyu/DATA/SAAT/misc/model.py", line 51, in generate_model n_classes=opt.n_classes) File "/media/louyu/DATA/SAAT/misc/models/resnext.py", line 66, in generate_model **kwargs) File "/media/louyu/DATA/SAAT/misc/models/resnext.py", line 53, in init shortcut_type, n_classes) File "/media/louyu/DATA/SAAT/misc/models/resnet.py", line 132, in init shortcut_type) File "/media/louyu/DATA/SAAT/misc/models/resnet.py", line 189, in _make_layer downsample=downsample)) TypeError: init() got an unexpected keyword argument 'in_planes'

    I wonder how to solve this problem. Thanks!

    opened by BBBoundary 5
  • About the frame features

    About the frame features

    Hi,

    Thanks for the paper and the code. I am wondering that when you used IRV2 to extract 2D frame features and selected 28 uniformly-spaced frames each video how do you aggregate these 28 feature vectors into one vector per video? Concatenate and project? Seemingly it is not clear in the paper.

    Thank you!

    opened by RyanLiut 4
  • About the best checkpoint

    About the best checkpoint

    Hi, when i run this code, after each epoch, it generated the new checkpoint. But in some epoch, there is a best score. Do i need to early stop or just run 200 epoches to get it's final checkpoint. Do the final checkpoint is the best performance checkpoint?

    opened by AcodeC 4
  • How to use the 2d and 3d features extracted by ./misc code in SAAT?

    How to use the 2d and 3d features extracted by ./misc code in SAAT?

    I notice that you use the original feature given by cst in your paper. And you use c3d in h5 format and mean pooling is used to get video level feature. But in your misc folder, you extract features in .npy format. So how to use the 2d and 3d features extracted by ./misc code in SAAT? I wonder wheather many codes should be changed to use the new feature and how to use it?

    opened by AcodeC 4
  • Bugs report

    Bugs report

    When I run make -f Makefile_msrvtt_svo train GID=0 EXP_NAME=xe FEATS="irv2 c3d category" BFEATS="roi_feat roi_box" USE_RL=0 CST=0 USE_MIXER=0 SCB_CAPTIONS=0 LOGLEVEL=DEBUG MAX_EPOCH=100 LAMBDA=20

    I got make: Nothing to be done for 'train'. What happend ?

    When I want to train on MSVD, I got: make: *** No rule to make target 'output/feature/msrvtt_train_resnet_mp1.h5', needed by 'output/model_svo/xe/resnetc3dcategory_msrvtt_concat_CIDEr_64_0.0001_20.pth'. Stop. Can you upload Makefile_msvd_svo ?

    opened by MarcusNerva 3
  • Key mapping of videos in h5 files

    Key mapping of videos in h5 files

    The code generates h5 files for which each video has some wierd keys (corresponding to their urls like "HzYtvOYOEoU_21_32"). But SAAT expects keys to be of integer form like 0, 1, 2 and so on. The easiest mapping would have been obtained by replacing vid with cnt here, which means keys (as expected by SAAT) is the index of video while enumerating.

    Could you please confirm if this was the actual mapping that is supposed to be used? It is a bit risky to assume anything on my own for key mappings,

    opened by ashryaagr 3
  • ModuleNotFoundError: No module named 'resnet'

    ModuleNotFoundError: No module named 'resnet'

    when I run ./misc/extract_feats_2D.py, the following error appeared:

    Traceback (most recent call last): File "E:/code/video_captioning/SAAT/misc/extract_feats_2D.py", line 7, in from resnet import resnet101 ModuleNotFoundError: No module named 'resnet'

    opened by ydwl-lynn 0
  • Captions on single raw video

    Captions on single raw video

    Hey @SydCaption Many thanks for open-sourcing such awesome work !!! 👍

    I am unable to understand the steps for inferencing using your pre-trained models on my own dataset. Can you please provide some help here? Many thanks!

    opened by amil-rp-work 0
Owner
null
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

null 688 Jan 4, 2023
FairFuzz: AFL extension targeting rare branches

FairFuzz An AFL extension to increase code coverage by targeting rare branches. FairFuzz has a particular advantage on programs with highly nested str

Caroline Lemieux 222 Nov 16, 2022
Allows including an action inside another action (by preprocessing the Yaml file). This is how composite actions should have worked.

actions-includes Allows including an action inside another action (by preprocessing the Yaml file). Instead of using uses or run in your action step,

Tim Ansell 70 Nov 4, 2022
Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21).

ACTION-Net Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21). Getting Started EgoGesture data folder struct

V-Sense 171 Dec 26, 2022
Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization' (ICCV-21 Oral)

Learning-Action-Completeness-from-Points Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal A

Pilhyeon Lee 67 Jan 3, 2023
Human Action Controller - A human action controller running on different platforms.

Human Action Controller (HAC) Goal A human action controller running on different platforms. Fun Easy-to-use Accurate Anywhere Fun Examples Mouse Cont

null 27 Jul 20, 2022
The official TensorFlow implementation of the paper Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

Action Transformer A Self-Attention Model for Short-Time Human Action Recognition This repository contains the official TensorFlow implementation of t

PIC4SeRCentre 20 Jan 3, 2023
Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.

WSDEC This is the official repo for our NeurIPS paper Weakly Supervised Dense Event Captioning in Videos. Description Repo directories ./: global conf

Melon(Xuguang Duan) 96 Nov 1, 2022
Videocaptioning.pytorch - A simple implementation of video captioning

pytorch implementation of video captioning recommend installing pytorch and pyth

Yiyu Wang 2 Jan 1, 2022
Source codes for "Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs"

Structure-Aware-BART This repo contains codes for the following paper: Jiaao Chen, Diyi Yang:Structure-Aware Abstractive Conversation Summarization vi

GT-SALT 56 Dec 8, 2022
[ICCV 2021] Group-aware Contrastive Regression for Action Quality Assessment

CoRe Created by Xumin Yu*, Yongming Rao*, Wenliang Zhao, Jiwen Lu, Jie Zhou This is the PyTorch implementation for ICCV paper Group-aware Contrastive

Xumin Yu 31 Dec 24, 2022
Compressed Video Action Recognition

Compressed Video Action Recognition Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, Philipp Krähenbühl. In CVPR, 2018. [Proj

Chao-Yuan Wu 479 Dec 26, 2022
AutoVideo: An Automated Video Action Recognition System

AutoVideo is a system for automated video analysis. It is developed based on D3M infrastructure, which describes machine learning with generic pipeline languages. Currently, it focuses on video action recognition, supporting various state-of-the-art video action recognition algorithms. It also supports automated model selection and hyperparameter tuning. AutoVideo is developed by DATA Lab at Texas A&M University.

Data Analytics Lab at Texas A&M University 267 Dec 17, 2022
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

Visual Inference Lab @TU Darmstadt 34 Nov 21, 2022
Semi-Autoregressive Transformer for Image Captioning

Semi-Autoregressive Transformer for Image Captioning Requirements Python 3.6 Pytorch 1.6 Prepare data Please use git clone --recurse-submodules to clo

YE Zhou 23 Dec 9, 2022
Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Capti

Yuqing Song 61 Oct 11, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

Vision CAIR Research Group, KAUST 140 Dec 28, 2022
Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 906 Jan 3, 2023