Towards Diverse Paragraph Captioning for Untrimmed Videos
This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Captioning for Untrimmed Videos (CVPR 2021).
Requirements
- Python 3.6
- Java 15.0.2
- PyTorch 1.2
- numpy, tqdm, h5py, scipy, six
Training & Inference
Data preparation
- Download the pre-extracted video features of ActivityNet Captions or Charades Captions datasets from BaiduNetdisk (code: he21).
- Decompress the downloaded files to the corresponding dataset folder in the ordered_feature/ directory.
Start training
- Train our model without reinforcement learning,
*
can be activitynet or charades.
$ cd driver
$ CUDA_VISIBLE_DEVICES=0 python transformer.py ../results/*/dm.token/model.json ../results/*/dm.token/path.json --is_train
- Fine-tune the pretrained model using self-critical with both accuracy and diversity rewards.
$ cd driver
$ CUDA_VISIBLE_DEVICES=0 python transformer.py ../results/*/dm.token.rl/model.json ../results/*/dm.token.rl/path.json --is_train --resume_file ../results/*/dm.token/model/epoch.*.th
- Train our model with key frames selection.
$ cd driver
$ CUDA_VISIBLE_DEVICES=0 python transformer.py ../results/*/key_frames/model.json ../results/*/key_frames/path.json --is_train --resume_file ../results/*/key_frames/pretrained.th
It will achieve a slightly worse result with only a half of the video features used at inference phase for faster decoding. You need to download the pretrained.th model at first for the key-frame selection.
Evaluation
The trained checkpoints have been saved at the results/*/folder/model/
directory. After evaluation, the generated captions (corresponding to the name file in the public_split) and evaluating scores will be saved at results/*/folder/pred/tst/
.
$ cd driver
$ CUDA_VISIBLE_DEVICES=0 python transformer.py ../results/*/folder/model.json ../results/*/folder/path.json --eval_set tst --resume_file ../results/*/folder/model/epoch.*.th
We also provide the pretrained models for the ActivityNet dataset here and Charades dataset here, which are re-run and achieve similar results with the paper.
Reference
If you find this repo helpful, please consider citing:
@inproceedings{song2021paragraph,
title={Towards Diverse Paragraph Captioning for Untrimmed Videos},
author={Song, Yuqing and Chen, Shizhe and Jin, Qin},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2021}
}