A Joint Video and Image Encoder for End-to-End Retrieval

Overview

Frozen️ in Time ❄️ ️️️️

A Joint Video and Image Encoder for End-to-End Retrieval

project page | arXiv | webvid-data alt text Repository containing the code, models, data for end-to-end retrieval. WebVid data can be found here


📝 Preparation

  1. Create conda env conda env create -f requirements/frozen.yml

  2. Create data / experiment folders mkdir data; mkdir exps, note this can just be a symlink to where you want to store big data.

🔧 Finetuning (benchmarks: MSR-VTT)

  1. wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip -P data; unzip data/MSRVTT.zip -d data

  2. Change num_gpus in the config file accordingly.

  3. Train python train.py --config configs/msrvtt_4f_i21k.json

  4. Test python test.py --resume exps/models/{EXP_NAME}/{EXP_TIMESTAMP}/model_best.pth

For finetuning a pretrained model, set "load_checkpoint": "PATH_TO_MODEL" in the config file.

🏋 ️‍️ Pretraining

  1. Download WebVid-2M (see https://github.com/m-bain/webvid)

  2. Download CC-3M (see https://ai.google.com/research/ConceptualCaptions/download)

  3. Train. python train.py --config CONFIG_PATH. Here are the different options:

    a. Dataset combinations

     i. CC-3M + WebVid2M: configs/cc-webvid2m-pt-i2k.json
     ii. WebVid2M : configs/webvid2m-pt-i2k.json
    

    You can add in an arbitrary number of image/video datasets for pre-training by adding as many dataloaders to the config file dataloader list as your heart desires. Adding more datasets will likely to higher downstream performance.

    b. Number of frames

    For image datasets, this should always be set to video_params": {"num_frames": 1, ...}.

    For video datasets, set this to what you want. N.B. More frames requires = more gpu memory.

    If, like us, you are not a big company and have limited compute, then you will benefit by training via a curriculum on the number of frames. A lot of the knowledge can be learned in the 1-frame setting, as we show in the paper. You can then finetune with more frames. See curriculum learning section

    c. Finetuning

    Set "load_checkpoint": "FULL_MODEL_PATH" in the config file. You can now use different experiment params, such as num_frames, to do curriculum learning for example.

🗄 Pretrained Weights

📚 Curriculum Learning on #frames

Curriculum learning on the number of frames in pretraining achieves similar performance with significant reduction in compute (both memory and training time). This is because model has higher throughput for fewer frames, as well as allowing a bigger batch size for the same gpu memory.

Our best model was trained on 1-frame then finetuned on 4-frames on CC+WebVid2M.

Train on 1-frame until the training loss converges, then finetune on 4-frames with the same config, from the 1-frame checkpoint via setting load_checkpoint in config file. 4-frame finetuning needs much less iterations (~10% of 1-frame setting is sufficient) since most of the knowledge is learned in the 1-frame setting.

📈 Experiment Logging and Visualising

This repository uses a sacred backbone for logging and tracking experiments, with a neptune front end. It makes life a lot easier. If you want to activate this:

  1. Create a neptune.ai account.
  2. Create a project, copy in your credentials in train.py and remove the ValueError
  3. Set neptune: true in your config files.

🎓 Cite

If you use this code in your research, please cite:

@misc{bain2021frozen,
      title={Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval}, 
      author={Max Bain and Arsha Nagrani and Gül Varol and Andrew Zisserman},
      year={2021},
      eprint={2104.00650},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🙏 Acknowledgements

This code is based off the pytorch-template https://github.com/victoresque/pytorch-template

As well as many good practices adopted from Samuel Albanie's https://github.com/albanie/collaborative-experts

Comments
  • Which results in paper correspond to the finetune command?

    Which results in paper correspond to the finetune command?

    I experiment the finetune procedure and run the command python train.py --config configs/msrvtt_4f_i21k.json.

    I got:

    [v2t_metrics]MSRVTT epoch 27, R@1: 16.1, R@5: 40.5, R@10 55.0, R@50 81.9MedR: 8, MeanR: 40.6
        epoch          : 27
        loss_0         : 0.7913076955540566
        val_loss_0     : 1.5775871678950295
        val_0_t2v_metrics_R1: 17.8
        val_0_t2v_metrics_R5: 40.6
        val_0_t2v_metrics_R10: 55.1
        val_0_t2v_metrics_R50: 81.5
        val_0_t2v_metrics_MedR: 8.0
        val_0_t2v_metrics_MeanR: 39.94
        val_0_t2v_metrics_geometric_mean_R1-R5-R10: 34.14804760940716
        val_0_v2t_metrics_R1: 16.1
        val_0_v2t_metrics_R5: 40.5
        val_0_v2t_metrics_R10: 55.0
        val_0_v2t_metrics_R50: 81.9
        val_0_v2t_metrics_MedR: 8.0
        val_0_v2t_metrics_MeanR: 40.5555
        val_0_v2t_metrics_geometric_mean_R1-R5-R10: 32.9772570568898
    Validation performance didn't improve for 10 epochs. Training stops.
    

    There are two R1 resutls. Which results corresponding to the results in paper. I found the R1 in Table5 is 31.0. It seems far from these implementation.

    opened by akira-l 5
  • Test set of MSR-VTT for downstream evaluation

    Test set of MSR-VTT for downstream evaluation

    Hi,

    In the paper, it is described that 'Following other works [35], we train on 9Ktrain+val videos and report results on the 1K-A test set'

    Howerver, in your provided code for text-to-video retrieval on MSR-VTT, it seems that the validation set and the test set are the same, which is named as 'val_list_jsfusion.txt' with 1K data.

    The results of your released model on MSR-VTT test set (val_list_jsfusion.txt) are higher than that reported in the paper.

    Is 'val_list_jsfusion.txt' the test set for MSR-VTT evaluation?

    Looking forward to your reply.

    opened by geyuying 5
  • Curriculum Learning and Video-Image Joint Training

    Curriculum Learning and Video-Image Joint Training

    Hi,

    I have a question about the curriculum learning. For the 1 frame pretraining, both CC3M and WebVid 2M dataset are used. But when finetuning on 4 frames stage, did you use both video and image for joint pretraining (4 frames for WebVid 2M and 1 frame for CC3M)? Since I cannot find any experimental details for "Joint image-video training" in the paper.

    Thanks in advance.

    opened by vateye 4
  • The provided Pillow package doesn't support WEBP images

    The provided Pillow package doesn't support WEBP images

    A warning is shown during training with CC3M data saying so. See: https://github.com/ContinuumIO/anaconda-issues/issues/10737 So I guess these images are going to be skipped during training?

    opened by bryant1410 4
  • Code/template for the demo?

    Code/template for the demo?

    Awesome project and great work! I was wondering if the code for the video search demo is available or could be made available? Would be very nice to have even just for debugging the process of fine tuning your model on a different dataset.

    opened by scottfleming 3
  • About Curriculum Learning

    About Curriculum Learning

    Thanks for your great job! Heer are some questions about Curriculum Learning.

    When fine-tuning from 1 frame to 4 frames,

    • should we need to interpolate the temporal position embedding ([1, dim] => [4, dim])? In my opinion, the image is seen as 1-frame video, if the temporal position embedding is interpolated, how can we add it to the image?
    • should we use the same hyperparameters (e.g., learning rate, epoch, warmup)?
    opened by Andy1621 3
  • CC3M data error

    CC3M data error

    Files download from the given link in CC3M is not match to this code. It raises error when reading with pandas: pandas.errors.ParserError: Error tokenizing data. C error: Expected 43 fields in line 23, saw 45

    Can you provide the correct version of CC3M data ?

    opened by akira-l 3
  • How can I use the pretrain results?

    How can I use the pretrain results?

    https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/models/cc-webvid2m-4f_stformer_b_16_224.pth.tar

    This is the Pretrained Weights you listed in the website. But the checkpoint is .tar, and when I unzip it, it is a file, and I do not know how to use it as a normal .pth checkpoints?

    opened by Qi-Zhangyang 3
  • Finetuning the pretrained model on MSR-VTT

    Finetuning the pretrained model on MSR-VTT

    Hi,

    Thanks for your excellent work!

    When I finetune the pretrained model that you provide on MSR-VTT, there is a warning shown as below:

    "Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']"

    Is it expected?

    Thanks! Yuying

    opened by geyuying 3
  • Config for the method trained on CC3M

    Config for the method trained on CC3M

    I was wondering if you could provide the config used in the paper to train only on CC3M. Is it exactly like the one CC3M+WebVid but removing the WebVid part?

    opened by bryant1410 3
  • Result about MSVD

    Result about MSVD

    Hi, Bain I found the experiment setting about MSVD result is not clear, meanwhile MCQ claims a different version. so I want to ask the result and setting about MSVD.

    opened by shufangxun 2
  • About the effects of sliding_window_stride

    About the effects of sliding_window_stride

    Hi Bain, I see you mentioned that setting sliding_window_stride=12 when evaluating retrieval on MSR-VTT (finetuned) helps improve the performance in other issues. I tryied this but didn't get the improvement.

    After finetuned with msrvtt_4f_i21k.json, the model is tested as the command presented in the README. Results are:

    [t2v_metrics] epoch 0, R@1: 28.9, R@5: 55.6, R@10 66.2, R@50 86.8MedR: 4, MeanR: 29.9                                                                                                                                                                  
    [v2t_metrics] epoch 0, R@1: 28.4, R@5: 56.5, R@10 66.2, R@50 88.1MedR: 4, MeanR: 25.6  
    

    After setting --sliding_window_stride=12 for test.py, the results are:

    [t2v_metrics] epoch 0, R@1: 28.8, R@5: 57.7, R@10 68.5, R@50 88.0MedR: 4, MeanR: 27.3
    [v2t_metrics] epoch 0, R@1: 30.0, R@5: 58.8, R@10 68.8, R@50 89.7MedR: 4, MeanR: 22.5
    

    It shows no obvious improvement in my test.

    In #41, sliding_window_stride indeed helps improve the evaluation performance. I don't know why it doesn't work here. I keep codes in test.py unchanged and only modify some codes in base_dataset.py to fit my environment (i.e., lower version of PyTorch ans TorchVision due to the limitation on the computing cluster). Besides, the version of ffmpeg on my cluster is low and is hard to update. Is the difference of enviroments the reason leading to the poor results?

    I just want to use one trained model with sound performance for some experiments in test phase (e.g., adversarial attacks), so I would like to see the finetuned Frozen-in-Time with R@1 over 30% as results in your paper shows. However, I failed to get such a model. :(

    The phenomenon seems weird and I will check further to try to reproduce higher results. Besides, if possible, would you mind share a finetuned model?

    opened by xiangyh9988 0
  • Can you share some recordings of your experiments

    Can you share some recordings of your experiments

    Can you share some recordings of your experiments like some graphs in neptune.ai or other logs tracking the performance/loss changes in training steps.

    I would like to compare the effects of some configurations(e.g. batch size) on training convergence in depth. I think this uses a contrastive loss that depends on a similarity matrix, may be effected by batch size and converges slower in a smaller batch size. In your experiments, it was not using large batch sizes and may not achived the best performance yet. I think I want to try something haha~

    opened by KAndHisC 2
  • Off-by-one issues with the frame sampling

    Off-by-one issues with the frame sampling

    I think there may be 2 off-by-one issues with the frame sampling. I'm not so sure about it and prefer to discuss it, that's why I don't send a patch.

    For the first one, this is the part of the code:

    https://github.com/m-bain/frozen-in-time/blob/542164b80eb339e4dba520daf1182460d2a3d5a9/base/base_dataset.py#L152-L155

    I think it should be:

    np.linspace(start=0, stop=vlen - 1, ...)
    

    (with a - 1)

    and:

    ranges.append((interv, intervals[idx + 1]))
    

    (without the - 1).

    Otherwise, the right part of each bucket it's gonna be ignored. For the uniform case, instead of doing the interval centroid ((a+b)/2), it takes (a+b-1)/2. This isn't a big deal though.

    For the second one, I think the random choice interval end should be + 1. When it does random.choice(range(...)) (which btw could be a random.randrange), the range excludes the stop value, so there's another - 1 hidden there.

    For example, in the training video "1013731484", which has only one frame according to Decord, for random it'd be:

    intervals == [0, 1]
    ranges = [(0, 0)]
    random.choice(range(0, 0)) == random.choice([]) <- exception
    

    And it fails silently, assigning all frames to black. Note this one also isn't a big deal as it'd fail with few videos, and with the rest, it'd have all the intervals shifted or something like that.

    opened by bryant1410 0
Owner
PhD Student, VGG, Oxford
null
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 9, 2023
Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

MUGE Multimodal Retrieval Baseline This repo is implemented based on the open_cl

null 47 Dec 16, 2022
Joint Learning of 3D Shape Retrieval and Deformation, CVPR 2021

Joint Learning of 3D Shape Retrieval and Deformation Joint Learning of 3D Shape Retrieval and Deformation Mikaela Angelina Uy, Vladimir G. Kim, Minhyu

Mikaela Uy 38 Oct 18, 2022
Joint Versus Independent Multiview Hashing for Cross-View Retrieval[J] (IEEE TCYB 2021, PyTorch Code)

Thanks to the low storage cost and high query speed, cross-view hashing (CVH) has been successfully used for similarity search in multimedia retrieval. However, most existing CVH methods use all views to learn a common Hamming space, thus making it difficult to handle the data with increasing views or a large number of views.

null 4 Nov 19, 2022
An end-to-end PyTorch framework for image and video classification

What's New: March 2021: Added RegNetZ models November 2020: Vision Transformers now available, with training recipes! 2020-11-20: Classy Vision v0.5 R

Facebook Research 1.5k Dec 31, 2022
Activity image-based video retrieval

Cross-modal-retrieval Our approach is focus on Activity Image-to-Video Retrieval (AIVR) task. The compared methods are state-of-the-art single modalit

BCMI 75 Oct 21, 2021
Official implementation of the ICCV 2021 paper "Joint Inductive and Transductive Learning for Video Object Segmentation"

JOINT This is the official implementation of Joint Inductive and Transductive learning for Video Object Segmentation, to appear in ICCV 2021. @inproce

Yunyao 35 Oct 16, 2022
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 687 Jan 7, 2023
Towards End-to-end Video-based Eye Tracking

Towards End-to-end Video-based Eye Tracking The code accompanying our ECCV 2020 publication and dataset, EVE. Authors: Seonwook Park, Emre Aksan, Xuco

Seonwook Park 76 Dec 12, 2022
Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Spatio-Temporal Entropy Model A Pytorch Reproduction of Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression. More details can

null 16 Nov 28, 2022
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

AdaFocusV2 This repo contains the official code and pre-trained models for AdaFo

null 79 Dec 26, 2022
Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)

E2FGVI (CVPR 2022) English | 简体中文 This repository contains the official implementation of the following paper: Towards An End-to-End Framework for Flo

Media Computing Group @ Nankai University 537 Jan 7, 2023
UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Unified Multi-modal Transformers This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Vi

Applied Research Center (ARC), Tencent PCG 84 Jan 4, 2023
A framework for joint super-resolution and image synthesis, without requiring real training data

SynthSR This repository contains code to train a Convolutional Neural Network (CNN) for Super-resolution (SR), or joint SR and data synthesis. The met

null 83 Jan 1, 2023
Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

CoMIR: Contrastive Multimodal Image Representation for Registration Framework ?? Registration of images in different modalities with Deep Learning ??

Methods for Image Data Analysis - MIDA 55 Dec 9, 2022
CoReNet is a technique for joint multi-object 3D reconstruction from a single RGB image.

CoReNet CoReNet is a technique for joint multi-object 3D reconstruction from a single RGB image. It produces coherent reconstructions, where all objec

Google Research 80 Dec 25, 2022
[ACM MM 2021] Joint Implicit Image Function for Guided Depth Super-Resolution

Joint Implicit Image Function for Guided Depth Super-Resolution This repository contains the code for: Joint Implicit Image Function for Guided Depth

hawkey 78 Dec 27, 2022
FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification

FPGA & FreeNet Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification by Zhuo Zheng, Yanfei Zhong, Ailong M

Zhuo Zheng 92 Jan 3, 2023
End-to-end image segmentation kit based on PaddlePaddle.

English | 简体中文 PaddleSeg PaddleSeg has released the new version including the following features: Our team won the AutoNUE@CVPR 2021 challenge, where

null 6.2k Jan 2, 2023