Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.

Overview

WSDEC

This is the official repo for our NeurIPS paper Weakly Supervised Dense Event Captioning in Videos.

Description

Repo directories

  • ./: global config files, training, evaluating scripts;
  • ./data: data dictionary;
  • ./model: our final models used to reproduce the results;
  • ./runs: the default output dictionary used to store our trained model and result files;
  • ./scripts: helper scripts;
  • ./third_party: third party dependency include the official evaluation scripts;
  • ./utils: helper functions;
  • ./train_script: all training scripts;
  • ./eval_script: all evalulating scripts.

Dependency

  • Python 2.7
  • CUDA 9.0(note: you will encounter a bug saying segmentation fault(core dump) if you run our code with CUDA 8.0)
    • But it seems that the bug still exists. See issue
  • [Pytorch 0.3.1](note: 0.3.1 is not compatible with newer version)
  • numpy, hdf5 and other necessary packages(no special requirement)

Usage for reproduction

Before we start

Before the training and testing, we should make sure the data, third party data are prepared, here is the one-by-one steps to make everything prepared.

1. Clone our repo and submodules

git clone --recursive https://github.com/XgDuan/WSDEC

2. Download all the data

  • Download the official C3D features, you can either download the data from the website or from our onedrive cloud.

    • Download from the official website; (Note, after you download the C3D features, you can either place it in the data folder and rename it as anet_v1.3.c3d.hdf5, or create a soft link in the data dictionary as ln -s YOURC3DFeature data/anet_v1.3.c3d.hdf5)
  • Download the dense video captioning data from the official website; (Similar to the C3D feature, you are supposed to place the download data in the data folder and rename it as densecap)

  • Download the data for the official evaluation scripts densevid_eval;

    • run the command sh download.sh scripts in the folder PREFIX/WSDEC/third_party/densevid_eval;
  • [Good News]: we write a shell script for you to download the data, just run the following command:

    cd data
    sh download.sh
    

3. Generate the dictionary for the caption model

python scripts/caption_preprocess.py

Training

There are two steps for model training: pretrain a not so bad caption model; and the second step, train the final/baseline model.

Our pretrained captioning model is trained.

python train_script/train_cg_pretrain.py

train our final model

python train_script/train_final.py --checkpoint_cg YOUR_PRETRAINED_CAPTION_MODEL.ckp --alias MODEL_NAME

train baselines

  1. train the baseline model without classification loss.
python train_script/train_baseline_regressor.py --checkpoint_cg YOUR_PRETRAINED_CAPTION_MODEL.ckp --alias MODEL_NAME
  1. train the baseline model without regression branch.
python train_script/train_final.py --checkpoint_cg YOUR_PRETRAINED_CAPTION_MODEL.ckp --regressor_scale 0 --alias MODEL_NAME

About the arguments

All the arguments we use can be found in the corresponding training scripts. You can also use your own argumnets if you like to do so. But please mind, some arguments are discarded(This is our own reimplementation of our paper, the first version codes are too dirty that no one would like to use it.)

Testing

Testing is easier than training. Firstly, in the process of training, our scripts will call the densevid_eval in a subprocess every time after we run the eval function. From these results, you can have a general grasp about the final performance by just have a look at the eval_results.txt scripts. Secondly, after some epochs, you can run the evaluation scripts:

  1. evaluate the full model or no_regression model:
python eval_script/evaluate.py --checkpoint YOUR_TRAINED_MODEL.ckp
  1. evaluate the no_classification model:
python eval_script/evaluate_baseline_regressor.py --checkpoint YOUR_TRAINED_MODEL.ckp
  1. evaluate the pretrained model with random temporal segment:
python eval_script/evaluate_pretrain.py --checkpoint YOUR_PRETRAIN_CAPTION_MODEL.ckp

Other usages

Besides reproduce our work, there are at least two interesting things you can do with our codes.

Train a supervised sentence localization model

To know what is sentence localization, you can have a look at our paper ABLR. Note that our work at a matter of fact provides an unsupervised solution towards sentence localization, we introduce the usage for the supervised model here. We have written the trainer, you can just run the following command and have a cup of coffee:

python train_script/train_sl.py

Train a supervised video event caption generation model

If you have read our paper, you would find that event captioning is the dual task of the aforementioned sentence localization task. To train such a model, just run the following command:

python train_script/train_cg.py

BUGS

You may encounter a cuda internal bug that says Segmentation fault(core dumped) during training if you are using cuda 8.0. If such things happen, try upgrading your cuda to 9.0.

other

We will add more description about how to use our code. Please feel free to contact us if you have any questions or suggestions.

Trained model and results

Links for our trained model

You can download our pretrained model for evaluation or further usage from our onedrive, which includes a pretrained caption generator(cg_pretrain.ckp), a baseline model without classification loss(baseline_noclass.ckp), a baseline model without regression branch(baseline_noregress.ckp), and our final model(final_model.ckp).

Cite the paper and give us star ⭐️

If you find our paper or code useful, please cite our paper using the following bibtex:

@incollection{NIPS2018_7569,
title = {Weakly Supervised Dense Event Captioning in Videos},
author = {Duan, Xuguang and Huang, Wenbing and Gan, Chuang and Wang, Jingdong and Zhu, Wenwu and Huang, Junzhou},
booktitle = {Advances in Neural Information Processing Systems 31},
editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
pages = {3062--3072},
year = {2018},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/7569-weakly-supervised-dense-event-captioning-in-videos.pdf}
}
Comments
  • Can't reproduce the reported results

    Can't reproduce the reported results

    Hi, Thanks for providing code in github. I am able to run your code. But unfortunately could not able to generate you reported results. Even I tried using your pretrained model from Onedrive specially for CIDEr, the scores are far away from the reported results. I followed your approach described in readme. Can you please tell me how can I able to reproduce the results?

    bug 
    opened by trahman888 9
  • What's the train_script/train_cg_pretrain.py?

    What's the train_script/train_cg_pretrain.py?

    In the "training" of your readme, it says "python train_script/train_cg_pretrain.py" first. However, there is no file named "train_cg_pretrain.py" in the "train_script" folder. Which file should we run, "train_cg.py" or "train_captionmodel_pretrain.py"?

    helpful 
    opened by jx-zhong-for-academic-purpose 5
  • Question encoundered while reproducing your experiments

    Question encoundered while reproducing your experiments

    Hi! Thanks for releasing your excellent work. While reproducing your experiments, I encountered several problems. 1)Is the default --translator_path should be changed from ./data/translator6000.pkl to ./data/translator.pkl to keep consistent with the dictionary generated in captioning_preprocessing.py? BTW, do I need to change the vocab_size accordingly? 2) I did not find a download.sh under the folder ./third_party/densevid_eval as stated in the ReadMe. Would u mind give me a hint about how to get that? Thanks again for releasing. It's really interesting work.

    opened by KevinQian97 4
  • third_party evaluation script

    third_party evaluation script

    Hi,

    Thank you for sharing your code in such an organized way and it is really helpful! I was trying to run your code and evaluate the result with the provided third-party scripts. However, I encountered an error during the evaluation and I do not know how to fix it. I am wondering did you encountered the same problem while running the evaluate.py provided by the third-party? And if so, how did you solve it?

    Here is the error while I was trying to run the evaluate.py script in the densevid_eval. 2531591841350_ pic_hd Before this, I added 'shell=True' to the subprocess.Popen in both ptbtokenizer.py and meteor.py in order to solve the following two errors: 2541591841352_ pic_hd 2551591841354_ pic_hd

    I am not so sure whether my changes to the two files cause the key error, so I am wondering did you also encountered this. Thanks so much in advance:)

    opened by XuchenWang 4
  • translator.pkl

    translator.pkl

    Hello, thank you very much for your sharing. Could you please tell me where to download the translator.pkl that is captioning dict between words and indexes? Looking forward to your reply.

    opened by gujinjing 1
  • Hi! I have some question about just generating caption about full trimmed video

    Hi! I have some question about just generating caption about full trimmed video

    Hi! First, Thank you for your awesome and kindness github!

    I'm on research about caption-to-video generation and I need dataset included pair of video and captions.

    So This github show me the hope. But I have some problem

    I already ready C3D-500dims features via C3D and PCA(n_samples * 1 * 500). But I don't have any video_length and video_mask because i used the trimmed video.

    So how can I run just captioning model with your pretrained model?

    Thanks you for reading :)

    opened by KiBeomHong 1
  • Segmentation fault (core dumped) even with Cuda-9.0

    Segmentation fault (core dumped) even with Cuda-9.0

    Thanks for sharing the code!

    I ran the training code with CUDA-9.0 under Pytorch-0.3.1-cuda90. But, I still met the bug. Can you tell me which part of the code leads to the bug? I would like to try to address it.

    Thanks.

    bug 
    opened by YapengTian 9
Owner
Melon(Xuguang Duan)
Lick the screen
Melon(Xuguang Duan)
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

null 1 Jan 23, 2022
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

null 688 Jan 4, 2023
Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding (AAAI 2020) - PyTorch Implementation

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding PyTorch implementation for the Scalable Attentive Sentence-Pair Modeling vi

Microsoft 25 Dec 2, 2022
Generic Event Boundary Detection: A Benchmark for Event Segmentation

Generic Event Boundary Detection: A Benchmark for Event Segmentation We release our data annotation & baseline codes for detecting generic event bound

null 47 Nov 22, 2022
Scikit-event-correlation - Event Correlation and Forecasting over High Dimensional Streaming Sensor Data algorithms

scikit-event-correlation Event Correlation and Changing Detection Algorithm Theo

Intellia ICT 5 Oct 30, 2022
Event-forecasting - Event Forecasting Algorithms With Python

event-forecasting Event Forecasting Algorithms Theory Correlating events in comp

Intellia ICT 4 Feb 15, 2022
Event sourced bank - A wide-and-shallow example using the Python event sourcing library

Event Sourced Bank A "wide but shallow" example of using the Python event sourci

null 3 Mar 9, 2022
Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Capti

Yuqing Song 61 Oct 11, 2022
Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code Using Hugginface And DeepSpeed

GPT-Neo-2.7B Fine-Tuning Example Using HuggingFace & DeepSpeed Installation cd venv/bin ./pip install -r ../../requirements.txt ./pip install deepspe

Nikita 180 Jan 5, 2023
Train emoji embeddings based on emoji descriptions.

emoji2vec This is my attempt to train, visualize and evaluate emoji embeddings as presented by Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko

Miruna Pislar 17 Sep 3, 2022
Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

Quasi-Dense Tracking This is the offical implementation of paper Quasi-Dense Similarity Learning for Multiple Object Tracking. We present a trailer th

ETH VIS Research Group 327 Dec 27, 2022
Syntax-Aware Action Targeting for Video Captioning

Syntax-Aware Action Targeting for Video Captioning Code for SAAT from "Syntax-Aware Action Targeting for Video Captioning" (Accepted to CVPR 2020). Th

null 59 Oct 13, 2022
Videocaptioning.pytorch - A simple implementation of video captioning

pytorch implementation of video captioning recommend installing pytorch and pyth

Yiyu Wang 2 Jan 1, 2022
Hybrid CenterNet - Hybrid-supervised object detection / Weakly semi-supervised object detection

Hybrid-Supervised Object Detection System Object detection system trained by hybrid-supervision/weakly semi-supervision (HSOD/WSSOD): This project is

null 5 Dec 10, 2022
Digan - Official PyTorch implementation of Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks

DIGAN (ICLR 2022) Official PyTorch implementation of "Generating Videos with Dyn

Sihyun Yu 147 Dec 31, 2022
Dense Unsupervised Learning for Video Segmentation (NeurIPS*2021)

Dense Unsupervised Learning for Video Segmentation This repository contains the official implementation of our paper: Dense Unsupervised Learning for

Visual Inference Lab @TU Darmstadt 173 Dec 26, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Repository relating to the CVPR21 paper TimeLens: Event-based Video Frame Interpolation

TimeLens: Event-based Video Frame Interpolation This repository is about the High Speed Event and RGB (HS-ERGB) dataset, used in the 2021 CVPR paper T

Robotics and Perception Group 544 Dec 19, 2022
Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts)

Introduction Pytorch implementation of Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Expert. | paper Song Park1

Clova AI Research 97 Dec 23, 2022