A Joint Video and Image Encoder for End-to-End Retrieval

Last update: Dec 25, 2022

Related tags

Deep Learning frozen-in-time

Overview

Frozen️ in Time ❄️ ️️️️ ⏳

A Joint Video and Image Encoder for End-to-End Retrieval

project page | arXiv | webvid-data Repository containing the code, models, data for end-to-end retrieval. WebVid data can be found here

📝 Preparation

Create conda env conda env create -f requirements/frozen.yml
Create data / experiment folders mkdir data; mkdir exps, note this can just be a symlink to where you want to store big data.

🔧 Finetuning (benchmarks: MSR-VTT)

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip -P data; unzip data/MSRVTT.zip -d data
Change num_gpus in the config file accordingly.
Train python train.py --config configs/msrvtt_4f_i21k.json
Test python test.py --resume exps/models/{EXP_NAME}/{EXP_TIMESTAMP}/model_best.pth

For finetuning a pretrained model, set "load_checkpoint": "PATH_TO_MODEL" in the config file.

🏋 ️‍️ Pretraining

Download WebVid-2M (see https://github.com/m-bain/webvid)
Download CC-3M (see https://ai.google.com/research/ConceptualCaptions/download)
Train. python train.py --config CONFIG_PATH. Here are the different options:

a. Dataset combinations
```
 i. CC-3M + WebVid2M: configs/cc-webvid2m-pt-i2k.json
 ii. WebVid2M : configs/webvid2m-pt-i2k.json
```
You can add in an arbitrary number of image/video datasets for pre-training by adding as many dataloaders to the config file dataloader list as your heart desires. Adding more datasets will likely to higher downstream performance.

b. Number of frames

For image datasets, this should always be set to video_params": {"num_frames": 1, ...}.

For video datasets, set this to what you want. N.B. More frames requires = more gpu memory.

If, like us, you are not a big company and have limited compute, then you will benefit by training via a curriculum on the number of frames. A lot of the knowledge can be learned in the 1-frame setting, as we show in the paper. You can then finetune with more frames. See curriculum learning section

c. Finetuning

Set "load_checkpoint": "FULL_MODEL_PATH" in the config file. You can now use different experiment params, such as num_frames, to do curriculum learning for example.

🗄 Pretrained Weights

CC-3M+WebVid-2M, 4-frames, base_patch_16_224

📚 Curriculum Learning on #frames

Curriculum learning on the number of frames in pretraining achieves similar performance with significant reduction in compute (both memory and training time). This is because model has higher throughput for fewer frames, as well as allowing a bigger batch size for the same gpu memory.

Our best model was trained on 1-frame then finetuned on 4-frames on CC+WebVid2M.

Train on 1-frame until the training loss converges, then finetune on 4-frames with the same config, from the 1-frame checkpoint via setting load_checkpoint in config file. 4-frame finetuning needs much less iterations (~10% of 1-frame setting is sufficient) since most of the knowledge is learned in the 1-frame setting.

📈 Experiment Logging and Visualising

This repository uses a sacred backbone for logging and tracking experiments, with a neptune front end. It makes life a lot easier. If you want to activate this:

Create a neptune.ai account.
Create a project, copy in your credentials in train.py and remove the ValueError
Set neptune: true in your config files.

🎓 Cite

If you use this code in your research, please cite:

@misc{bain2021frozen,
      title={Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval}, 
      author={Max Bain and Arsha Nagrani and Gül Varol and Andrew Zisserman},
      year={2021},
      eprint={2104.00650},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🙏 Acknowledgements

This code is based off the pytorch-template https://github.com/victoresque/pytorch-template

As well as many good practices adopted from Samuel Albanie's https://github.com/albanie/collaborative-experts

Comments

Which results in paper correspond to the finetune command?

I experiment the finetune procedure and run the command python train.py --config configs/msrvtt_4f_i21k.json.

I got:

[v2t_metrics]MSRVTT epoch 27, R@1: 16.1, R@5: 40.5, R@10 55.0, R@50 81.9MedR: 8, MeanR: 40.6
    epoch          : 27
    loss_0         : 0.7913076955540566
    val_loss_0     : 1.5775871678950295
    val_0_t2v_metrics_R1: 17.8
    val_0_t2v_metrics_R5: 40.6
    val_0_t2v_metrics_R10: 55.1
    val_0_t2v_metrics_R50: 81.5
    val_0_t2v_metrics_MedR: 8.0
    val_0_t2v_metrics_MeanR: 39.94
    val_0_t2v_metrics_geometric_mean_R1-R5-R10: 34.14804760940716
    val_0_v2t_metrics_R1: 16.1
    val_0_v2t_metrics_R5: 40.5
    val_0_v2t_metrics_R10: 55.0
    val_0_v2t_metrics_R50: 81.9
    val_0_v2t_metrics_MedR: 8.0
    val_0_v2t_metrics_MeanR: 40.5555
    val_0_v2t_metrics_geometric_mean_R1-R5-R10: 32.9772570568898
Validation performance didn't improve for 10 epochs. Training stops.

There are two R1 resutls. Which results corresponding to the results in paper. I found the R1 in Table5 is 31.0. It seems far from these implementation.

opened by akira-l 5

Test set of MSR-VTT for downstream evaluation

Hi,

In the paper, it is described that 'Following other works [35], we train on 9Ktrain+val videos and report results on the 1K-A test set'

Howerver, in your provided code for text-to-video retrieval on MSR-VTT, it seems that the validation set and the test set are the same, which is named as 'val_list_jsfusion.txt' with 1K data.

The results of your released model on MSR-VTT test set (val_list_jsfusion.txt) are higher than that reported in the paper.

Is 'val_list_jsfusion.txt' the test set for MSR-VTT evaluation?

Looking forward to your reply.

opened by geyuying 5
Curriculum Learning and Video-Image Joint Training

Hi,

I have a question about the curriculum learning. For the 1 frame pretraining, both CC3M and WebVid 2M dataset are used. But when finetuning on 4 frames stage, did you use both video and image for joint pretraining (4 frames for WebVid 2M and 1 frame for CC3M)? Since I cannot find any experimental details for "Joint image-video training" in the paper.

Thanks in advance.

opened by vateye 4
The provided Pillow package doesn't support WEBP images

A warning is shown during training with CC3M data saying so. See: https://github.com/ContinuumIO/anaconda-issues/issues/10737 So I guess these images are going to be skipped during training?

opened by bryant1410 4
Code/template for the demo?

Awesome project and great work! I was wondering if the code for the video search demo is available or could be made available? Would be very nice to have even just for debugging the process of fine tuning your model on a different dataset.

opened by scottfleming 3
About Curriculum Learning
Thanks for your great job! Heer are some questions about Curriculum Learning.

When fine-tuning from 1 frame to 4 frames,

should we need to interpolate the temporal position embedding ([1, dim] => [4, dim])? In my opinion, the image is seen as 1-frame video, if the temporal position embedding is interpolated, how can we add it to the image?

should we use the same hyperparameters (e.g., learning rate, epoch, warmup)?
opened by Andy1621 3
CC3M data error

Files download from the given link in CC3M is not match to this code. It raises error when reading with pandas: pandas.errors.ParserError: Error tokenizing data. C error: Expected 43 fields in line 23, saw 45

Can you provide the correct version of CC3M data ?

opened by akira-l 3
How can I use the pretrain results?

https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/models/cc-webvid2m-4f_stformer_b_16_224.pth.tar

This is the Pretrained Weights you listed in the website. But the checkpoint is .tar, and when I unzip it, it is a file, and I do not know how to use it as a normal .pth checkpoints?

opened by Qi-Zhangyang 3
Finetuning the pretrained model on MSR-VTT

Hi,

Thanks for your excellent work!

When I finetune the pretrained model that you provide on MSR-VTT, there is a warning shown as below:

"Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']"

Is it expected?

Thanks! Yuying

opened by geyuying 3
Config for the method trained on CC3M

I was wondering if you could provide the config used in the paper to train only on CC3M. Is it exactly like the one CC3M+WebVid but removing the WebVid part?

opened by bryant1410 3
Result about MSVD

Hi, Bain I found the experiment setting about MSVD result is not clear, meanwhile MCQ claims a different version. so I want to ask the result and setting about MSVD.

opened by shufangxun 2
About the effects of sliding_window_stride
Hi Bain, I see you mentioned that setting sliding_window_stride=12 when evaluating retrieval on MSR-VTT (finetuned) helps improve the performance in other issues. I tryied this but didn't get the improvement.

After finetuned with msrvtt_4f_i21k.json, the model is tested as the command presented in the README. Results are:

[t2v_metrics] epoch 0, R@1: 28.9, R@5: 55.6, R@10 66.2, R@50 86.8MedR: 4, MeanR: 29.9 [v2t_metrics] epoch 0, R@1: 28.4, R@5: 56.5, R@10 66.2, R@50 88.1MedR: 4, MeanR: 25.6

After setting --sliding_window_stride=12 for test.py, the results are:

[t2v_metrics] epoch 0, R@1: 28.8, R@5: 57.7, R@10 68.5, R@50 88.0MedR: 4, MeanR: 27.3 [v2t_metrics] epoch 0, R@1: 30.0, R@5: 58.8, R@10 68.8, R@50 89.7MedR: 4, MeanR: 22.5

It shows no obvious improvement in my test.

In #41, sliding_window_stride indeed helps improve the evaluation performance. I don't know why it doesn't work here. I keep codes in test.py unchanged and only modify some codes in base_dataset.py to fit my environment (i.e., lower version of PyTorch ans TorchVision due to the limitation on the computing cluster). Besides, the version of ffmpeg on my cluster is low and is hard to update. Is the difference of enviroments the reason leading to the poor results?

I just want to use one trained model with sound performance for some experiments in test phase (e.g., adversarial attacks), so I would like to see the finetuned Frozen-in-Time with R@1 over 30% as results in your paper shows. However, I failed to get such a model. :(

The phenomenon seems weird and I will check further to try to reproduce higher results. Besides, if possible, would you mind share a finetuned model?
opened by xiangyh9988 0
Can you share some recordings of your experiments

Can you share some recordings of your experiments like some graphs in neptune.ai or other logs tracking the performance/loss changes in training steps.

I would like to compare the effects of some configurations(e.g. batch size) on training convergence in depth. I think this uses a contrastive loss that depends on a similarity matrix, may be effected by batch size and converges slower in a smaller batch size. In your experiments, it was not using large batch sizes and may not achived the best performance yet. I think I want to try something haha~

opened by KAndHisC 2
Off-by-one issues with the frame sampling
I think there may be 2 off-by-one issues with the frame sampling. I'm not so sure about it and prefer to discuss it, that's why I don't send a patch.

For the first one, this is the part of the code:

https://github.com/m-bain/frozen-in-time/blob/542164b80eb339e4dba520daf1182460d2a3d5a9/base/base_dataset.py#L152-L155

I think it should be:

np.linspace(start=0, stop=vlen - 1, ...)

(with a - 1)

and:

ranges.append((interv, intervals[idx + 1]))

(without the - 1).

Otherwise, the right part of each bucket it's gonna be ignored. For the uniform case, instead of doing the interval centroid ((a+b)/2), it takes (a+b-1)/2. This isn't a big deal though.

For the second one, I think the random choice interval end should be + 1. When it does random.choice(range(...)) (which btw could be a random.randrange), the range excludes the stop value, so there's another - 1 hidden there.

For example, in the training video "1013731484", which has only one frame according to Decord, for random it'd be:

intervals == [0, 1] ranges = [(0, 0)] random.choice(range(0, 0)) == random.choice([]) <- exception

And it fails silently, assigning all frames to black. Note this one also isn't a big deal as it'd fail with few videos, and with the rest, it'd have all the intervals shifted or something like that.
opened by bryant1410 0

Owner

PhD Student, VGG, Oxford

GitHub

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

MUGE Multimodal Retrieval Baseline This repo is implemented based on the open_cl

47 Dec 16, 2022

Joint Learning of 3D Shape Retrieval and Deformation, CVPR 2021

Joint Learning of 3D Shape Retrieval and Deformation Joint Learning of 3D Shape Retrieval and Deformation Mikaela Angelina Uy, Vladimir G. Kim, Minhyu

38 Oct 18, 2022

Joint Versus Independent Multiview Hashing for Cross-View Retrieval[J] (IEEE TCYB 2021, PyTorch Code)

Thanks to the low storage cost and high query speed, cross-view hashing (CVH) has been successfully used for similarity search in multimedia retrieval. However, most existing CVH methods use all views to learn a common Hamming space, thus making it difficult to handle the data with increasing views or a large number of views.

4 Nov 19, 2022

An end-to-end PyTorch framework for image and video classification

What's New: March 2021: Added RegNetZ models November 2020: Vision Transformers now available, with training recipes! 2020-11-20: Classy Vision v0.5 R

1.5k Dec 31, 2022

Activity image-based video retrieval

Cross-modal-retrieval Our approach is focus on Activity Image-to-Video Retrieval (AIVR) task. The compared methods are state-of-the-art single modalit

75 Oct 21, 2021

Official implementation of the ICCV 2021 paper "Joint Inductive and Transductive Learning for Video Object Segmentation"

JOINT This is the official implementation of Joint Inductive and Transductive learning for Video Object Segmentation, to appear in ICCV 2021. @inproce

35 Oct 16, 2022

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

687 Jan 7, 2023

Towards End-to-end Video-based Eye Tracking

Towards End-to-end Video-based Eye Tracking The code accompanying our ECCV 2020 publication and dataset, EVE. Authors: Seonwook Park, Emre Aksan, Xuco

76 Dec 12, 2022

Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Spatio-Temporal Entropy Model A Pytorch Reproduction of Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression. More details can

16 Nov 28, 2022

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

AdaFocusV2 This repo contains the official code and pre-trained models for AdaFo

79 Dec 26, 2022

Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

E2FGVI (CVPR 2022) English | 简体中文 This repository contains the official implementation of the following paper: Towards An End-to-End Framework for Flo

Media Computing Group @ Nankai University

537 Jan 7, 2023

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Unified Multi-modal Transformers This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Vi

Applied Research Center (ARC), Tencent PCG

84 Jan 4, 2023

A framework for joint super-resolution and image synthesis, without requiring real training data

SynthSR This repository contains code to train a Convolutional Neural Network (CNN) for Super-resolution (SR), or joint SR and data synthesis. The met

83 Jan 1, 2023

Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

CoMIR: Contrastive Multimodal Image Representation for Registration Framework ?? Registration of images in different modalities with Deep Learning ??

55 Dec 9, 2022

A Joint Video and Image Encoder for End-to-End Retrieval

Related tags

Overview

Frozen️ in Time ❄️ ️️️️ ⏳

A Joint Video and Image Encoder for End-to-End Retrieval

📝 Preparation

🔧 Finetuning (benchmarks: MSR-VTT)

🏋 ️‍️ Pretraining

🗄 Pretrained Weights

📚 Curriculum Learning on #frames

📈 Experiment Logging and Visualising

🎓 Cite

🙏 Acknowledgements

Comments

Owner

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

Joint Learning of 3D Shape Retrieval and Deformation, CVPR 2021

Joint Versus Independent Multiview Hashing for Cross-View Retrieval[J] (IEEE TCYB 2021, PyTorch Code)

An end-to-end PyTorch framework for image and video classification

Activity image-based video retrieval

Official implementation of the ICCV 2021 paper "Joint Inductive and Transductive Learning for Video Object Segmentation"

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

Towards End-to-end Video-based Eye Tracking

Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

A framework for joint super-resolution and image synthesis, without requiring real training data

Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

CoReNet is a technique for joint multi-object 3D reconstruction from a single RGB image.

[ACM MM 2021] Joint Implicit Image Function for Guided Depth Super-Resolution

FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification

End-to-end image segmentation kit based on PaddlePaddle.