Semantic Grouping Network for Video Captioning
Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo. AAAI 2021. [arxiv]
Environment
- Ubuntu 16.04
- CUDA 9.2
- cuDNN 7.4.2
- Java 8
- Python 2.7.12
- PyTorch 1.1.0
- Other python packages specified in requirements.txt
Usage
1. Setup
$ pip install -r requirements.txt
2. Prepare Data
-
Download the GloVe Embedding from here and locate it at
data/Embeddings/GloVe/GloVe_300.json
. -
Extract features from datasets and locate them at
data/
./features/ .hdf5 e.g. ResNet101 features of the MSVD dataset will be located at
data/MSVD/features/ResNet101.hdf5
.I refer to this repo for extracting the ResNet101 features, and this repo for extracting the 3D-ResNext101 features.
-
Split the features into train, val, and test sets by running following commands.
$ python -m split.MSVD $ python -m split.MSR-VTT
You can skip step 2-3 and download below files
- MSVD
- MSR-VTT
3. Prepare The Code for Evaluation
Clone the evaluation code from the official coco-evaluation repo.
$ git clone https://github.com/tylin/coco-caption.git
$ mv coco-caption/pycocoevalcap .
$ rm -rf coco-caption
4. Extract Negative Videos
$ python extract_negative_videos.py
or you can skip this step as the output files are already uploaded at data/
5. Train
$ python train.py
You can change some hyperparameters by modifying config.py
.
Pretrained Models - SGN(R101+RN)
- MSVD: https://drive.google.com/file/d/12Xjd8VdDiyvBxM9sPnnXz87Wa_eVv0ii/view?usp=sharing
- MSR-VTT: https://drive.google.com/file/d/1kx7FBi2UBCgIP7R9ideMpwXY0Gnqn7Yx/view?usp=sharing
*Disclaimer: The models above do not have the same weight as the models used in the paper (I trained them again because I lost).
6. Evaluate
$ python evaluate.py --ckpt_fpath
License
The source-code in this repository is released under MIT License.