Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Salesforce

Last update: Dec 21, 2022

Related tags

Deep Learning representation-learning vision-and-language video-question-answering video-text-retrieval video-language prompt-learning

Overview

ALPRO

Align and Prompt: Video-and-Language Pre-training with Entity Prompts [Paper]

Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi

Official PyTorch code for ALPRO. This repository supports pre-training as well as finetuning on

Text-Video Retrieval on MSRVTT and DiDeMo.
Video Question Anwsering on MSRVTT and MSVD.

Requirements

Our implementation is tested on Ubuntu 20.04.1 with NVIDIA A100 GPUs. Supports for other platforms and hardwares are possible with no warrant. To install the required packages:

cd env && bash install_pkg.sh

Data Preparation

Download Annotations and Pre-trained Checkpoints
- Text annotations
- Checkpoints of pre-trained model and finetuned model
- Externel resources
- unzip data.zip, output.zip, ext.zip under ALPRO/.
Download raw videos of downstream datasets.
- MSRVTT:
  - download train_val_videos.zip and test_videos.zip from e.g. here.
  - check md5sum:
```
51f2394d279cf84f1642defd9a651e6f  train_val_videos.zip
0af68454cec9d586e92805739f3911d0  test_videos.zip
```
  - unzip all the videos into data/msrvtt_ret/videos (10k in total).
  - create the following soft link:
```
ln -s data/msrvtt_ret/videos data/msrvtt_qa/videos```
```
- MSVD:
  - download from official release:
```
wget -nc https://www.cs.utexas.edu/users/ml/clamp/videoDescription/YouTubeClips.tar
```
  - check md5sum:
```
9bdb20fcf14d59524a6febca9f6a8d89  YouTubeClips.tar
```
  - unzip all the videos to data/msvd_qa/videos (1,970 videos in total).
```
mkdir data/msvd_qa/videos/ 
tar xvf YouTubeClips.tar -C data/msvd_qa/videos --strip-components=1
```
- DiDeMo:
  - Following instructions and download from the official release here;
  - unzip all the videos into data/didemo_ret/videos.
  - Note there might be a couple videos missing. See here to download. However, as they account for a small portion of training set, you may feel safe to ignore.
  - Convert all the DiDeMo videos into *.mp4 format using e.g. ffmpeg.
  - We obtained 10,463 videos following these steps (with one video 77807177@N00_5753455690_1e04ccb364 missing).

The directory is expected to be in the structure below:

.
|-config_release  # configuration files
|-data  # text annotations and raw videos
|---didemo_ret
|-----txt
|-----videos
|---msrvtt_qa/...
|---msrvtt_ret/...
|---msvd_qa/...
|-env  # scripts to install packages
|-ext  # external resources, e.g. bert tokenizer
|-output  # checkpoints for pre-trained/finetuned models
|---downstreams
|-----didemo_ret
|-------public
|---------ckpt # official finetuned checkpoints
|---------log # inference log
|---------results_test
|-----------step_best_1_mean
|-----msrvtt_qa/...
|-----msrvtt_ret/...
|-----msvd_qa/...
|-run_scripts  # bash scripts to launch experiments
|-src  # source code

Inference with Official Checkpoints

cd run_scripts
bash inf_msrvtt_ret.sh
# {'text2video': {'r1': 33.9, 'r5': 60.7, 'r10': 73.2, 'medianR': 3.0, 'meanR': 27.404}}
bash inf_didemo_ret.sh
# {'text2video': {'r1': 35.9, 'r5': 67.5, 'r10': 78.8, 'medianR': 3.0, 'meanR': 19.125}}
bash inf_msrvtt_qa.sh
# {'ratios': {'what_ratio': [68.48, 49872], 'who_ratio': [27.99, 20385], 'how_ratio': [2.25, 1640], 'where_ratio': [0.34, 250], 'when_ratio': [0.93, 677]}, 'overall_acc': 42.12, 'what_acc': 36.05, 'who_acc': 52.24, 'how_acc': 85.67, 'where_acc': 42.8, 'when_acc': 78.88}
bash inf_msvd_qa.sh
# {'ratios': {'what_ratio': [61.93, 8150], 'who_ratio': [34.6, 4554], 'how_ratio': [2.81, 370], 'where_ratio': [0.21, 28], 'when_ratio': [0.44, 58]}, 'overall_acc': 45.91, 'what_acc': 37.02, 'who_acc': 58.59, 'how_acc': 81.62, 'where_acc': 46.43, 'when_acc': 72.41}

Downstream Task Finetuning

To finetune on downstream tasks with the pre-trained checkpoint output/pretrain/alpro_pretrained_ckpt.pt

cd run_scripts
bash ft_msrvtt_ret.sh
bash ft_didemo_ret.sh
bash ft_msrvtt_qa.sh
bash ft_msvd_qa.sh

For example, with MSRVTT retrieval:

cd ALPRO/

export PYTHONPATH="$PYTHONPATH:$PWD"
echo $PYTHONPATH

CONFIG_PATH='config_release/msrvtt_ret.json'

horovodrun -np 8 python src/tasks/run_video_retrieval.py \ # change -np to GPUs numbers.
    --config $CONFIG_PATH \
    --output_dir /export/home/workspace/experiments/alpro/finetune/msrvtt_ret/$(date '+%Y%m%d%H%M%S')  # change to your local path to store finetuning ckpts and logs

Run inference with locally-finetuned checkpoints.

 cd ALPRO/

 export PYTHONPATH="$PYTHONPATH:$PWD"
 echo $PYTHONPATH

 STEP='best'

 CONFIG_PATH='config_release/msrvtt_ret.json'
 OUTPUT_DIR='[INPUT_YOUR_OUTPUT_PATH_HERE]'

 TXT_DB='data/msrvtt_ret/txt/test.jsonl'
 IMG_DB='data/msrvtt_ret/videos'

 horovodrun -np 8 python src/tasks/run_video_retrieval.py \
     --do_inference 1 \
     --inference_split test \
     --inference_model_step $STEP \
     --inference_txt_db $TXT_DB \
     --inference_img_db $IMG_DB \
     --inference_batch_size 64 \
     --output_dir $OUTPUT_DIR \
     --config $CONFIG_PATH

OUTPUT_DIR is the path after the --output_dir option in the finetuning script.
$STEP is a string, which tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference.

Pretraining

Download WebVid2M and CC-3M.
- Put WebVid2M videos under data/webvid2m;
- 💡 we downsample webvid2m videos to 10% of the original FPS to speed-up video loading;
- change data/cc3m/txt/cc3m.json with local image paths.
Training Prompter:
```
cd run_scripts && bash pt_prompter.sh
```
Training video-language model:
```
cd run_scripts && bash pt_alpro.sh
```
If you would like to use custom prompter weight, please change teacher_weights_path in config_release/pretrain_alpro.json
To finetune with pre-trained checkpoints, please change e2e_weights_path in the finetuning config files, e.g. config_release/msrvtt_ret.json.

Citation

If you find ALPRO useful for your research, please consider citing:

  @inproceedings{li2021align,
    title={Align and Prompt: Video-and-Language Pre-training with Entity Prompts},
    author={Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi},
    booktitle={arxiv},
    year={2021}
  }

Acknowledgement

We thank members at Salesforce Research for their helpful discussions.

The implementation of ALPRO relies on resources from ClipBERT, transformers, TimeSformer, The code is implemented using PyTorch, with multi-GPU support from Horovod and gradient-checkpoint. We thank the original authors for their open-sourcing and encourage ALPRO users to cite their works when applicable.

Comments

Issues loading CC3M

Hi, I am trying to pretrain the model using CC3M dataset,

After the first iteration (batch), the program would stuck, and give the following warning. horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

Is there any way that I can avoid this?

opened by vateye 5
how to downsample webvid2m videos to 10% of the original FPS

As you mentioned in pretraining data preparation "we downsample webvid2m videos to 10% of the original FPS to speed-up video loading", can you elaborate on how to achieve this?

Thank you

opened by 1024er 4
Inference

Hi,

In the inference we always load the best model. However, after fine-tuning there is no checkpoint named $OUTPUT_DIR/ckpt/model_step_best.pt. Can you point to the line in the code where the best checkpoint is saved?

Thank you.

opened by avinashsai 3
An academic issues on your paper

In the video encoder part, the output is {v_cls, v_1, ..., v_k} (so the dimension is (k+1)*d) therefore, the dimension of multi-modal video-text encoder is (k+N_t+1)*d but according to paper: you claim that the dimension of multi-modal video-text encoder is (N_v+N_t+1)*d I'm confused about this...

opened by chenhaishun 2
Using multiple GPUs vs single GPU

Hi,

Congratulations on the amazing work. Will there be any difference in performance if I use just a single GPU and what are the changes to be made in eg: msvd_qa.json?

Thank you.

opened by avinashsai 1
Zero-shot for MSRVTT retrieval results

Hi, I have tried to use the provided pretrained checkpoint of ALPRO to finetune on MSRVTT.

Before start training, it would test on the validation set, and it gives me the results:

{'text2video': {'r1': 17.9, 'r5': 40.0, 'r10': 50.9, 'medianR': 10.0, 'meanR': 57.259}, 'video2text': {'r1': 16.0, 'r5': 34.6, 'r10': 46.0, 'medianR': 13.0, 'meanR': 61.448}

I think it does not match the results presented in the paper which achieves R@1 with 24.1, R@5 with 44.7. I am wondering why it would happen?

opened by vateye 1
MSR-VTT Zero-shot

Hi ,

I saw "We pre-train ALPRO for 100k iterations, roughly equivalent to 10 epochs" in your paper. So there will be ten checkpoints, which one is the zero-shot checkpoint for testing MSRVTT, and how to choose?

Looking forward to your reply, thanks!

opened by chaochen99 1
MSR-VTT dataset split

Hi, Thanks for sharing the code!

I saw "use 7k videos for training and report results on the 1k test split" in your paper. When I downloaded the MSR-VTT dataset, there are only 7K train sets and 3K test sets, but no val dataset. Could you share the code for dividing the dataset to avoid discrepancies in results?

Looking forward to your reply.

opened by chaochen99 1
I could not download the DiDeMo dataset

I could not download the DiDeMo dataset from https://drive.google.com/drive/u/1/folders/1_oyJ5rQiZboipbMl6tkhY8v0s9zDkvJc ，could you help me to find a easier way to download it? Thanks!!

opened by cdqncn 0
Weight Decay

Hi, as stated in the issue, the ALPRO does use weight decay. But I did not find the process that passing the parameter "weight_decay" during the optimizer initialization.

optimizer = OptimCls(model.parameters(), lr=opts.learning_rate, betas=opts.betas)

opened by vateye 3
installing environment

Thanks for the sharing code! I was trying to set up the environment but met with some problem especially on installing apex; I wonder if it is possible to provide a .yaml file that can used to create the environment using only Conda? or a docker container for setting up the environment? Thanks!

opened by MikeWangWZHL 4

Owner

Salesforce

A variety of vendor agnostic projects which power Salesforce

GitHub

GEP (GDB Enhanced Prompt) - a GDB plug-in for GDB command prompt with fzf history search, fish-like autosuggestions, auto-completion with floating window, partial string matching in history, and more!

GEP (GDB Enhanced Prompt) GEP (GDB Enhanced Prompt) is a GDB plug-in which make your GDB command prompt more convenient and flexibility. Why I need th

23 Dec 21, 2022

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

201 Nov 21, 2022

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

This is the official PyTorch implementation of the ALBEF paper [Blog]. This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on MSCOCO and Flickr30k, and visual grounding on RefCOCO+. Pre-trained and finetuned checkpoints are released.

805 Jan 9, 2023

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

75 Nov 2, 2022

Gapmm2: gapped alignment using minimap2 (align transcripts to genome)

gapmm2: gapped alignment using minimap2 This tool is a wrapper for minimap2 to r

2 Jan 27, 2022

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

24 May 30, 2022

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

1.3k Dec 31, 2022

Pytorch based library to rank predicted bounding boxes using text/image user's prompts.

pytorch_clip_bbox: Implementation of the CLIP guided bbox ranking for Object Detection. Pytorch based library to rank predicted bounding boxes using t

50 Nov 27, 2022

[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

RoSTER The source code used for Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training, p

60 Dec 30, 2022

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Related tags

Overview

ALPRO

Align and Prompt: Video-and-Language Pre-training with Entity Prompts [Paper]

Requirements

Data Preparation

Inference with Official Checkpoints

Downstream Task Finetuning

Pretraining

Citation

Acknowledgement

Comments

Owner

Salesforce

GEP (GDB Enhanced Prompt) - a GDB plug-in for GDB command prompt with fzf history search, fish-like autosuggestions, auto-completion with floating window, partial string matching in history, and more!

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Gapmm2: gapped alignment using minimap2 (align transcripts to genome)

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Pytorch based library to rank predicted bounding boxes using text/image user's prompts.

[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

EMNLP 2021 Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

Learning to Prompt for Vision-Language Models.

Chinese clinical named entity recognition using pre-trained BERT model

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm