End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

Overview

PDVC

PWC PWC

Official implementation for End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

[paper] [valse论文速递(Chinese)]

This repo supports:

  • two video captioning tasks: dense video captioning and video paragraph captioning
  • two datasets: ActivityNet Captions and YouCook2
  • video features containing C3D, TSN, and TSP.
  • visualization of the generated captions of your own videos

Table of Contents:

Updates

  • (2021.11.19) add code for running PDVC on raw videos and visualize the generated captions (support Chinese and other non-English languages)
  • (2021.11.19) add pretrained models with TSP features. It achieves 9.03 METEOR(2021) and 6.05 SODA_c, a very competitive results on ActivityNet Captions without self-critical sequence training.
  • (2021.08.29) add TSN pretrained models and support YouCook2

Introduction

PDVC is a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. Without bells and whistles, extensive experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results, surpassing the state-of-the-art methods when its localization accuracy is on par with them. pdvc.jpg

Preparation

Environment: Linux, GCC>=5.4, CUDA >= 9.2, Python>=3.7, PyTorch>=1.5.1

  1. Clone the repo
git clone --recursive https://github.com/ttengwang/PDVC.git
  1. Create vitual environment by conda
conda create -n PDVC python=3.7
source activate PDVC
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1 -c pytorch
conda install ffmpeg
pip install -r requirement.txt
  1. Compile the deformable attention layer (requires GCC >= 5.4).
cd pdvc/ops
sh make.sh

Running PDVC on Your Own Videos

Download a pretrained model (GoogleDrive) with TSP features and put it into ./save. Then run:

video_folder=visualization/videos
output_folder=visualization/output
pdvc_model_path=save/anet_tsp_pdvc/model-best.pth
output_language=en
bash test_and_visualize.sh $video_folder $output_folder $pdvc_model_path $output_language

check the $output_folder, you will see a new video with embedded captions. Note that we generate non-English captions by translating the English captions by GoogleTranslate. To produce chinese captions, set output_language=zh-cn. For other language support, find the abbreviation of your language at this url, and you also may need to download a font supporting your language and put it into ./visualization.

demo.gifdemo.gif

Training and Validation

Download Video Features

cd data/anet/features
bash download_anet_c3d.sh
# bash download_anet_tsn.sh
# bash download_i3d_vggish_features.sh
# bash download_tsp_features.sh

Dense Video Captioning

  1. PDVC with learnt proposals
# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID}
# The script will evaluate the model for every epoch. The results and logs are saved in `./save`.

# Evaluation
eval_folder=anet_c3d_pdvc # specify the folder to be evaluated
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type queries --gpu_id ${GPU_ID}
  1. PDVC with ground-truth proposals
# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID}

# Evaluation
eval_folder=anet_c3d_pdvc_gt
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type gt_proposals --gpu_id ${GPU_ID}

Video Paragraph Captioning

  1. PDVC with learnt proposals
# Training
config_path=cfgs/anet_c3d_pdvc.yml
python train.py --cfg_path ${config_path} --criteria_for_best_ckpt pc --gpu_id ${GPU_ID} 

# Evaluation
eval_folder=anet_c3d_pdvc # specify the folder to be evaluated
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type queries --gpu_id ${GPU_ID}
  1. PDVC with ground-truth proposals
# Training
config_path=cfgs/anet_c3d_pdvc_gt.yml
python train.py --cfg_path ${config_path} --criteria_for_best_ckpt pc --gpu_id ${GPU_ID}

# Evaluation
eval_folder=anet_c3d_pdvc_gt
python eval.py --eval_folder ${eval_folder} --eval_transformer_input_type gt_proposals --gpu_id ${GPU_ID}

Performance

Dense video captioning

Model Features config_path Url Recall Precision BLEU4 METEOR2018 METEOR2021 CIDEr SODA_c
PDVC_light C3D cfgs/anet_c3d_pdvcl.yml Google Drive 55.30 58.42 1.55 7.13 7.66 24.80 5.23
PDVC C3D cfgs/anet_c3d_pdvc.yml Google Drive 55.20 57.36 1.82 7.48 8.09 28.16 5.47
PDVC_light TSN cfgs/anet_tsn_pdvcl.yml Google Drive 55.34 57.97 1.66 7.41 7.97 27.23 5.51
PDVC TSN cfgs/anet_tsn_pdvc.yml Google Drive 56.21 57.46 1.92 8.00 8.63 29.00 5.68
PDVC_light TSP cfgs/anet_tsp_pdvcl.yml Google Drive 55.24 57.78 1.77 7.94 8.55 28.25 5.95
PDVC TSP cfgs/anet_tsp_pdvc.yml Google Drive 55.79 57.39 2.17 8.37 9.03 31.14 6.05

Notes:

Video paragraph captioning

Model Features config_path BLEU4 METEOR CIDEr
PDVC C3D cfgs/anet_c3d_pdvc.yml 9.67 14.74 16.43
PDVC TSN cfgs/anet_tsn_pdvc.yml 10.18 15.96 20.66
PDVC TSP cfgs/anet_tsp_pdvc.yml 10.46 16.42 20.91

Notes:

  • Paragraph-level scores are evaluated on the ActivityNet Entity ae-val set.

Citation

If you find this repo helpful, please consider citing:

@inproceedings{wang2021end,
  title={End-to-End Dense Video Captioning with Parallel Decoding},
  author={Wang, Teng and Zhang, Ruimao and Lu, Zhichao and Zheng, Feng and Cheng, Ran and Luo, Ping},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={6847--6857},
  year={2021}
}
@ARTICLE{wang2021echr,
  author={Wang, Teng and Zheng, Huicheng and Yu, Mingjing and Tian, Qian and Hu, Haifeng},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={Event-Centric Hierarchical Representation for Dense Video Captioning}, 
  year={2021},
  volume={31},
  number={5},
  pages={1890-1900},
  doi={10.1109/TCSVT.2020.3014606}}

Acknowledgement

The implementation of Deformable Transformer is mainly based on Deformable DETR. The implementation of the captioning head is based on ImageCaptioning.pytorch. We thanks the authors for their efforts.

Comments
  • No such file or directory: 'visualization/output/generated_captions/dvc_results.json'

    No such file or directory: 'visualization/output/generated_captions/dvc_results.json'

    hi when I run code as below:

    video_folder=visualization/videos
    output_folder=visualization/output
    pdvc_model_path=save/anet_tsp_pdvc/model-best.pth
    output_language=en
    bash test_and_visualize.sh $video_folder $output_folder $pdvc_model_path $output_language
    

    and the error is generated:

    **from densevid_eval3.SODA.soda import SODA
    

    ModuleNotFoundError: No module named 'densevid_eval3.SODA.soda' START VISUALIZATION Traceback (most recent call last): File "visualization/visualization.py", line 154, in d = json.load(open(opt.dvc_file))['results'] FileNotFoundError: [Errno 2] No such file or directory: 'visualization/output/generated_captions/dvc_results.json'**

    so where is dvc_results.json? and how can i got it?

    thanks

    opened by yumulinfeng1 12
  • Question about the result difference of video paragraph captioning

    Question about the result difference of video paragraph captioning

    Thanks for the great work! I notice that in the Table 4 of your paper, PDVC can achieve "B@4 11.80| M 15.93 | C 27.27" in ActivityNet Captions ae-val set, but it is "B@4 10.18 | M 15.96 | C 20.66" for PDVC with TSN features shown in the Readme. I wonder if the two datasets (ActivityNet Captions v.s. ActivityNet Entity) are different that leads to such different results? Looking forward to your reply.

    opened by wanghao14 7
  • Few questions about training

    Few questions about training

    Hello @ttengwang , I am trying to train your model from scratch (just for learning purpose). However I am facing few issues:

    1. the train_caption_file or val_caption_file does not have labels, which is being used in video_dataset.py (also in class loss). Am I using some wrong file?
    2. I tried with labels from action_proposal dataset (with captioning related part removed), but the loss_ce doesn't decrease at all, both in train and val (did you face any issues like this?). Also the loss_ce is coming in ranges of 300-400.
    3. How many epochs you trained before getting decent captions?
    opened by saharshleo 7
  • "Running PDVC on Your Own Videos": Did i miss something?

    Hi

    Thank you for your great work

    I loaded your pretrained model and ran your code using my video dataset (SumMe: video summarization benchmark),

    but the results are really weird.

    most captions doesn't represent visual features

    Capture Capture2 Capture4

    https://user-images.githubusercontent.com/56618962/167997537-5b21d8bc-a9b2-4e97-b735-93dfe36189e3.mp4

    Capture3

    I just loaded your models and ran on the video datasets

    most video captions are very weird

    Did i miss something???

    thank you

    opened by Jeiyoon 6
  • 关于实验结果

    关于实验结果

    您好,我想问一下当我用PDVC with learnt proposals 训练出来的结果是与readme中Dense video caption(with learnt proposals)中的结果对比还是跟论文中Predicted proposals对比呢 Predicted proposals与learnt proposals又有什么不同呢? 麻烦您能回答我这个困惑么

    opened by llljjj88 5
  • About evaluation indicators

    About evaluation indicators

    I would like to ask if the evaluation index of the reproduction model of the paper is only described by video paragraphs. If I want to get the evaluation index of dense video caption

    opened by llljjj88 5
  • i3d+vggish results

    i3d+vggish results

    Hello professor,

    When I reproduced your ‘i3d+vggish’ model, I found that i can not achieve the same results compared with the original paper. I don't know if there is something wrong with my settings.

    Thinks

    opened by upccpu 4
  • BLEU4/CIDEr

    BLEU4/CIDEr

    I'm sorry to bother you again, professor. As for the two indicators BLEU4 and CIDEr of Dense Video Caption task in the paper, how can I get these two results?

    opened by cyy-1102 4
  • 使用GT proposal时测得的paragraph captioning结果偏低

    使用GT proposal时测得的paragraph captioning结果偏低

    作者您好~我使用的是TSP的特征以及预训练好的模型,测试predicted proposal时得到与readme中相近结果(bleu4:10.46, METEOR: 16.43, CIDEr: 20.92),这个结果比论文Table 4中的结果要好,可能是因为用了更好的特征。

    但当我使用同样的特征和模型测试GT proposal时,得到(bleu4:11.17, METEOR15.58, CIDEr: 22.70),这个结果又明显不如论文Table 4中的结果,这是为什么呢?是测试GT proposal用的模型和测试predicted proposal的模型不一样吗?

    如果方便的话,能不能给我发一份模型在predicted proposal和GT proposal两种条件下的预测结果呀,我们打算搜集一些模型的结果进行一个人工评测,我的邮箱是[email protected],感谢!

    opened by PKULiuHui 4
  • caption my custom video

    caption my custom video

    Hi @ttengwang ~ Thanks for the sharing of your wonderful work! I want to caption my custom video, but unfortunately I find that most codes for captioning are starting from extracted features, and little instructions are provided for the extraction process. It's very inconvenient for me, cause I'm not so familiar with the captioning task, and I just want to utilize the tool for some applications. So could you please kindly give me some detailed instructions on how I can get the captions from a raw video? I will appreciate it a lot!

    Thanks, Zhihong

    opened by dawnlh 4
  • A question about object detection

    A question about object detection

    Thank you so much for this wonderful project. When I tried to run your code on my validation set, I ran into some problems. For example, in a video, a cat runs out of a Christmas gift box, but the prediction is: a woman runs out of the Christmas gift box. Another video of mine shows some sheep walking and the prediction is that some horses are walking. From this, it can be seen that the model can recognize the action, but not the type of the object. I think it may be the problem of ActivityNet, because the animal category in the dataset only contains dogs and horses. Could you please provide a pre training weight obtained after pre-training on ImageNet-22K. I think this may be really effective for the model when it comes to object detection. Finally thank you for your contribution.

    opened by qt2139 3
Owner
Teng Wang
My research interests focus on deep learning and computer vision.
Teng Wang
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

null 688 Jan 4, 2023
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 9, 2023
[ICCV 2021] Official Tensorflow Implementation for "Single Image Defocus Deblurring Using Kernel-Sharing Parallel Atrous Convolutions"

KPAC: Kernel-Sharing Parallel Atrous Convolutional block This repository contains the official Tensorflow implementation of the following paper: Singl

Hyeongseok Son 50 Dec 29, 2022
[ICCV 2021] FaPN: Feature-aligned Pyramid Network for Dense Image Prediction

FaPN: Feature-aligned Pyramid Network for Dense Image Prediction [arXiv] [Project Page] @inproceedings{ huang2021fapn, title={{FaPN}: Feature-alig

Shihua Huang 23 Jul 22, 2022
Exploring Relational Context for Multi-Task Dense Prediction [ICCV 2021]

Adaptive Task-Relational Context (ATRC) This repository provides source code for the ICCV 2021 paper Exploring Relational Context for Multi-Task Dense

David Brüggemann 35 Dec 5, 2022
[ICCV'21] NEAT: Neural Attention Fields for End-to-End Autonomous Driving

NEAT: Neural Attention Fields for End-to-End Autonomous Driving Paper | Supplementary | Video | Poster | Blog This repository is for the ICCV 2021 pap

null 254 Jan 2, 2023
Dense Unsupervised Learning for Video Segmentation (NeurIPS*2021)

Dense Unsupervised Learning for Video Segmentation This repository contains the official implementation of our paper: Dense Unsupervised Learning for

Visual Inference Lab @TU Darmstadt 173 Dec 26, 2022
Syntax-Aware Action Targeting for Video Captioning

Syntax-Aware Action Targeting for Video Captioning Code for SAAT from "Syntax-Aware Action Targeting for Video Captioning" (Accepted to CVPR 2020). Th

null 59 Oct 13, 2022
Videocaptioning.pytorch - A simple implementation of video captioning

pytorch implementation of video captioning recommend installing pytorch and pyth

Yiyu Wang 2 Jan 1, 2022
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 687 Jan 7, 2023
An end-to-end PyTorch framework for image and video classification

What's New: March 2021: Added RegNetZ models November 2020: Vision Transformers now available, with training recipes! 2020-11-20: Classy Vision v0.5 R

Facebook Research 1.5k Dec 31, 2022
Towards End-to-end Video-based Eye Tracking

Towards End-to-end Video-based Eye Tracking The code accompanying our ECCV 2020 publication and dataset, EVE. Authors: Seonwook Park, Emre Aksan, Xuco

Seonwook Park 76 Dec 12, 2022
A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval project page | arXiv | webvid-data Repository containing the code,

null 225 Dec 25, 2022
Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Spatio-Temporal Entropy Model A Pytorch Reproduction of Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression. More details can

null 16 Nov 28, 2022
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

AdaFocusV2 This repo contains the official code and pre-trained models for AdaFo

null 79 Dec 26, 2022
Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

E2FGVI (CVPR 2022) English | 简体中文 This repository contains the official implementation of the following paper: Towards An End-to-End Framework for Flo

Media Computing Group @ Nankai University 537 Jan 7, 2023
codes for "Scheduled Sampling Based on Decoding Steps for Neural Machine Translation" (long paper of EMNLP-2022)

Scheduled Sampling Based on Decoding Steps for Neural Machine Translation (EMNLP-2021 main conference) Contents Overview Background Quick to Use Furth

Adaxry 13 Jul 25, 2022
PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

This is the official implementation of the following paper: Torsten Scholak, Nathan Schucher, Dzmitry Bahdanau. PICARD - Parsing Incrementally for Con

ElementAI 217 Jan 1, 2023
PyTorch implementation of D2C: Diffuison-Decoding Models for Few-shot Conditional Generation.

D2C: Diffuison-Decoding Models for Few-shot Conditional Generation Project | Paper PyTorch implementation of D2C: Diffuison-Decoding Models for Few-sh

Jiaming Song 90 Dec 27, 2022