VLG-Net: Video-Language Graph Matching Networks for Video Grounding
Introduction
Official repository for VLG-Net: Video-Language Graph Matching Networks for Video Grounding. [ArXiv Preprint]
The paper is accepted to the first edition fo the ICCV workshop: AI for Creative Video Editing and Understanding (CVEU).
Installation
Clone the repository and move to folder:
git clone https://github.com/Soldelli/VLG-Net.git
cd VLG-Net
Install environmnet:
conda env create -f environment.yml
If installation fails, please follow the instructions in file doc/environment.md
(link).
Data
Download the following resources and extract the content in the appropriate destination folder. See table.
Resource | Download Link | File Size | Destination Folder |
---|---|---|---|
StandfordCoreNLP-4.0.0 | link | (~0.5GB) | ./datasets/ |
TACoS | link | (~0.5GB) | ./datasets/ |
ActivityNet-Captions | link | (~29GB) | ./datasets/ |
DiDeMo | link | (~13GB) | ./datasets/ |
GCNeXt warmup | link | (~0.1GB) | ./datasets/ |
Pretrained Models | link | (~0.1GB) | ./models/ |
The folder structure should be as follows:
.
├── configs
│
├── datasets
│ ├── activitynet1.3
│ │ ├── annotations
│ │ └── features
│ ├── didemo
│ │ ├── annotations
│ │ └── features
│ ├── tacos
│ │ ├── annotations
│ │ └── features
│ ├── gcnext_warmup
│ └── standford-corenlp-4.0.0
│
├── doc
│
├── lib
│ ├── config
│ ├── data
│ ├── engine
│ ├── modeling
│ ├── structures
│ └── utils
│
├── models
│ ├── activitynet
│ └── tacos
│
├── outputs
│
└── scripts
Training
Copy paste the following commands in the terminal.
Load environment:
conda activate vlg
- For ActivityNet-Captions dataset, run:
python train_net.py --config-file configs/activitynet.yml OUTPUT_DIR outputs/activitynet
- For TACoS dataset, run:
python train_net.py --config-file configs/tacos.yml OUTPUT_DIR outputs/tacos
Evaluation
For simplicity we provide scripts to automatically run the inference on pretrained models. See script details if you want to run inference on a different model.
Load environment:
conda activate vlg
Then run one of the following scripts to launch the evaluation.
- For ActivityNet-Captions dataset, run:
bash scripts/activitynet.sh
- For TACoS dataset, run:
bash scripts/tacos.sh
Expected results:
After cleaning the code and fixing a couple of minor bugs, performance changed (slightly) with respect to reported numbers in the paper. See below table.
ActivityNet | [email protected] | [email protected] | [email protected] | [email protected] |
---|---|---|---|---|
Paper | 46.32 | 29.82 | 77.15 | 63.33 |
Current | 46.32 | 29.79 | 77.19 | 63.36 |
TACoS | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] |
---|---|---|---|---|---|---|
Paper | 57.21 | 45.46 | 34.19 | 81.80 | 70.38 | 56.56 |
Current | 57.16 | 45.56 | 34.14 | 81.48 | 70.13 | 56.34 |
Citation
If any part of our paper and code is helpful to your work, please cite with:
@inproceedings{soldan2021vlg,
title={VLG-Net: Video-Language Graph Matching Network for Video Grounding},
author={Soldan, Mattia and Xu, Mengmeng and Qu, Sisi and Tegner, Jesper and Ghanem, Bernard},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={3224--3234},
year={2021}
}