PyTorch code for MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Overview

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

PyTorch code for our ACL 2020 paper "MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning" by Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, and Mohit Bansal

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events.

Related works:

Getting started

Prerequisites

  1. Clone this repository
# no need to add --recursive as all dependencies are copied into this repo.
git clone https://github.com/jayleicn/recurrent-transformer.git
cd recurrent-transformer
  1. Prepare feature files

Download features from Google Drive: rt_anet_feat.tar.gz (39GB) and rt_yc2_feat.tar.gz (12GB). These features are repacked from features provided by densecap.

mkdir video_feature && cd video_feature
tar -xf path/to/rt_anet_feat.tar.gz 
tar -xf path/to/rt_yc2_feat.tar.gz 
  1. Install dependencies
  • Python 2.7
  • PyTorch 1.1.0
  • nltk
  • easydict
  • tqdm
  • tensorboardX
  1. Add project root to PYTHONPATH
source setup.sh

Note that you need to do this each time you start a new session.

Training and Inference

We give examples on how to perform training and inference with MART.

  1. Build Vocabulary
bash scripts/build_vocab.sh DATASET_NAME

DATASET_NAME can be anet for ActivityNet Captions or yc2 for YouCookII.

  1. MART training

The general training command is:

bash scripts/train.sh DATASET_NAME MODEL_TYPE

MODEL_TYPE can be one of [mart, xl, xlrg, mtrans, mart_no_recurrence], see details below.

MODEL_TYPE Description
mart Memory Augmented Recurrent Transformer
xl Transformer-XL
xlrg Transformer-XL with recurrent gradient
mtrans Vanilla Transformer
mart_no_recurrence mart with recurrence disabled

To train our MART model on ActivityNet Captions:

bash scripts/train.sh anet mart

Training log and model will be saved at results/anet_re_*.
Once you have a trained model, you can follow the instructions below to generate captions.

  1. Generate captions
bash scripts/translate_greedy.sh anet_re_* val

Replace anet_re_* with your own model directory name. The generated captions are saved at results/anet_re_*/greedy_pred_val.json

  1. Evaluate generated captions
bash scripts/eval.sh anet val results/anet_re_*/greedy_pred_val.json

The results should be comparable with the results we present at Table 2 of the paper. E.g., B@4 10.33; R@4 5.18.

Citations

If you find this code useful for your research, please cite our paper:

@inproceedings{lei2020mart,
  title={MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning},
  author={Lei, Jie and Wang, Liwei and Shen, Yelong and Yu, Dong and Berg, Tamara L and Bansal, Mohit},
  booktitle={ACL},
  year={2020}
}

Others

This code used resources from the following projects: transformers, transformer-xl, densecap, OpenNMT-py.

Contact

jielei [at] cs.unc.edu

Comments
  •  empty dataset and video&description to .npy&csv&json

    empty dataset and video&description to .npy&csv&json

    When I execute scripts/train.sh yc2 mart, I get this error. mart_err2

    mart_err

    len(train_dataset) gives me 0, yet all the steps prior to the training stage have been respected, and have been successfully completed. How can I solve the problem please? @jayleicn I work on google colab.

    opened by Tikquuss 7
  • Inference on single video

    Inference on single video

    Hi,

    Thank you for your work. I'm making code to allow for easy testing of caption generation on user videos. Can you tell me how I can generate captions for my own videos?

    What features are required? is there a feature generation code I can use? what is the command to run inference on my own videos?

    Thanks

    opened by nikky4D 7
  • Function 'LogSoftmaxBackward' returned nan values in its 0th output.

    Function 'LogSoftmaxBackward' returned nan values in its 0th output.

    First, the train.sh file have some problems that ${dset_name} should be ${model_type} in the Line 39 42 48 50. But when the model_type is 'xl' or 'xlrg', I got the Error:

    File "src/train.py", line 639, in main() File "src/train.py", line 635, in main train(model, train_loader, val_loader, device, opt) File "src/train.py", line 330, in train model, training_data, optimizer, ema, device, opt, writer, epoch_i) File "src/train.py", line 78, in train_epoch input_masks_list, token_type_ids_list, input_labels_list) File "src/rtransformer/model.py", line 40, in forward output = self.log_softmax(output[valid_indices])

    Training =>: 0%| | 0/313 [00:05<?, ?it/s] Traceback (most recent call last): File "src/train.py", line 639, in main() File "src/train.py", line 635, in main train(model, train_loader, val_loader, device, opt) File "src/train.py", line 330, in train model, training_data, optimizer, ema, device, opt, writer, epoch_i) File "src/train.py", line 131, in train_epoch loss.backward() RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

    Pls help me, thanks!

    opened by lingshi0606 4
  • Consultation on the experimental results in the paper

    Consultation on the experimental results in the paper

    Thank you for your code! I have a question for you. Can you tell me whether the experimental results of other methods (eg, VTransformer, GVD, AdvInf) in Table 1 and table 2 are obtained from your own experiments or from the corresponding papers ?

    opened by lszp 3
  • I cannot reduce the results on yc2. Can you share some parameters?

    I cannot reduce the results on yc2. Can you share some parameters?

    Hello, it is me too. I can reduce the results in your paper MART on anet. But the results on yc2 is CIDEr 23.46, not 35.74, Meteor 15.1, not 15.9 and the BLEU@4 is 6.88, not 8. Can you share some parameters different from the anet, which is beneficial for yc2 to a 30+ CIDEr? Thanks very much.

    opened by lingshi0606 1
  • no such file or directory: model_tmp_greedy_val_lang.json

    no such file or directory: model_tmp_greedy_val_lang.json

    hi thanks for your good work and sharing your code.

    when I run your code today, I found that there are some errors during the training. when I run the command "bash scripts/train.sh yc2 mart" it gives me the error like the following:

    File "/home/chenj0g/Desktop/mart/recurrent-transformer/src/utils.py", line 18, in load_json with open(file_path, "r") as f: IOError: [Errno 2] No such file or directory: '/home/chenj0g/Desktop/mart/recurrent-transformer/results/yc2_re_init_2021_01_03_15_09_03/model_tmp_greedy_pred_val_lang.json'

    do you know how could I solve this problem?

    opened by junchen14 1
  • Wrong shape in memory initializer & updater

    Wrong shape in memory initializer & updater

    Hi,

    Thanks for sharing the code. I found a small error in memory initializer & updater when using MART. The memory has shape [N, L, intermediate_size], but the projection layers are fromhidden_size to hidden_size. If intermediate_size and hidden_size are not equal, it will raise a wrong shape error.

    opened by YoPatapon 1
  • Pre-trained model?

    Pre-trained model?

    Hello? Thank you for your interesting work. I was trying to generate captioning, however I can't find pretrained model. there Is no pretrained model provided or did I miss something? Do I have to train the whole network only to evaluate the model? Thank you

    opened by DesaleF 0
  • no such file or directory: model_tmp_greedy_val_lang.json

    no such file or directory: model_tmp_greedy_val_lang.json

    hi thanks for your good work and sharing your code.

    when I run your code today, I found that there are some errors during the training. when I run the command "bash scripts/train.sh yc2 mart" it gives me the error like the following:

    File "/home/chenj0g/Desktop/mart/recurrent-transformer/src/utils.py", line 18, in load_json with open(file_path, "r") as f: IOError: [Errno 2] No such file or directory: '/home/chenj0g/Desktop/mart/recurrent-transformer/results/yc2_re_init_2021_01_03_15_09_03/model_tmp_greedy_pred_val_lang.json'

    do you know how could I solve this problem?****

    opened by zhipengLi116 0
Owner
Jie Lei 雷杰
UNC CS PhD student, vision+language.
Jie Lei 雷杰
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

null 1 Jan 23, 2022
A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

TecoGAN-PyTorch Introduction This is a PyTorch reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution (VSR). Please refer to

null 165 Dec 17, 2022
Code for "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video", CVPR 2021 oral

NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video Project Page | Paper NeuralRecon: Real-Time Coherent 3D Reconstruction from Mon

ZJU3DV 1.4k Dec 30, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 8, 2023
The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

Code for "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval" (ACL 2021, Long) This is the repository for baseline m

Akari Asai 25 Oct 30, 2022
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

null 688 Jan 4, 2023
Github project for Attention-guided Temporal Coherent Video Object Matting.

Attention-guided Temporal Coherent Video Object Matting This is the Github project for our paper Attention-guided Temporal Coherent Video Object Matti

null 71 Dec 19, 2022
Implementation of Bidirectional Recurrent Independent Mechanisms (Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules)

BRIMs Bidirectional Recurrent Independent Mechanisms Implementation of the paper Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neura

Sarthak Mittal 26 May 26, 2022
Space Time Recurrent Memory Network - Pytorch

Space Time Recurrent Memory Network - Pytorch (wip) Implementation of Space Time Recurrent Memory Network, recurrent network competitive with attentio

Phil Wang 50 Nov 7, 2021
Source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated Recurrent Memory Network

KaGRMN-DSG_ABSA This repository contains the PyTorch source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated

XingBowen 4 May 20, 2022
PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Memory In Memory Networks It is based on the paper Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spati

Yang Li 12 May 30, 2022
PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

Soohwan Kim 565 Jan 4, 2023
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.

SCOOD-UDG (ICCV 2021) This repository is the official implementation of the paper: Semantically Coherent Out-of-Distribution Detection Jingkang Yang,

Jake YANG 62 Nov 21, 2022
Temporally Coherent GAN SIGGRAPH project.

TecoGAN This repository contains source code and materials for the TecoGAN project, i.e. code for a TEmporally COherent GAN for video super-resolution

Duc Linh Nguyen 2 Jan 18, 2022
Videocaptioning.pytorch - A simple implementation of video captioning

pytorch implementation of video captioning recommend installing pytorch and pyth

Yiyu Wang 2 Jan 1, 2022
Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

TheSys Group @ CMU CS 78 Jan 7, 2023
Episodic-memory - Ego4D Episodic Memory Benchmark

Ego4D Episodic Memory Benchmark EGO4D is the world's largest egocentric (first p

null 3 Feb 18, 2022
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 5, 2023