A Joint Video and Image Encoder for End-to-End Retrieval

Overview

Frozen️ in Time ❄️ ️️️️

A Joint Video and Image Encoder for End-to-End Retrieval

(arXiv)

Repository to contain the code, models, data for end-to-end retrieval.

Work in progress

Code provided to train end-to-end model on MSRVTT.

Set path locations in msrvtt_4f_i21k.json

conda env create -f requirements/frozen.yml

python train.py --config configs/msrvtt_4f_i21k.json

TODO:

[x] conda env

[ ] msrvtt data zip

[ ] pretrained models

[ ] webvid data

[ ] Other benchmarks

alt text

Comments
  • Which results in paper correspond to the finetune command?

    Which results in paper correspond to the finetune command?

    I experiment the finetune procedure and run the command python train.py --config configs/msrvtt_4f_i21k.json.

    I got:

    [v2t_metrics]MSRVTT epoch 27, R@1: 16.1, R@5: 40.5, R@10 55.0, R@50 81.9MedR: 8, MeanR: 40.6
        epoch          : 27
        loss_0         : 0.7913076955540566
        val_loss_0     : 1.5775871678950295
        val_0_t2v_metrics_R1: 17.8
        val_0_t2v_metrics_R5: 40.6
        val_0_t2v_metrics_R10: 55.1
        val_0_t2v_metrics_R50: 81.5
        val_0_t2v_metrics_MedR: 8.0
        val_0_t2v_metrics_MeanR: 39.94
        val_0_t2v_metrics_geometric_mean_R1-R5-R10: 34.14804760940716
        val_0_v2t_metrics_R1: 16.1
        val_0_v2t_metrics_R5: 40.5
        val_0_v2t_metrics_R10: 55.0
        val_0_v2t_metrics_R50: 81.9
        val_0_v2t_metrics_MedR: 8.0
        val_0_v2t_metrics_MeanR: 40.5555
        val_0_v2t_metrics_geometric_mean_R1-R5-R10: 32.9772570568898
    Validation performance didn't improve for 10 epochs. Training stops.
    

    There are two R1 resutls. Which results corresponding to the results in paper. I found the R1 in Table5 is 31.0. It seems far from these implementation.

    opened by akira-l 5
  • Test set of MSR-VTT for downstream evaluation

    Test set of MSR-VTT for downstream evaluation

    Hi,

    In the paper, it is described that 'Following other works [35], we train on 9Ktrain+val videos and report results on the 1K-A test set'

    Howerver, in your provided code for text-to-video retrieval on MSR-VTT, it seems that the validation set and the test set are the same, which is named as 'val_list_jsfusion.txt' with 1K data.

    The results of your released model on MSR-VTT test set (val_list_jsfusion.txt) are higher than that reported in the paper.

    Is 'val_list_jsfusion.txt' the test set for MSR-VTT evaluation?

    Looking forward to your reply.

    opened by geyuying 5
  • Curriculum Learning and Video-Image Joint Training

    Curriculum Learning and Video-Image Joint Training

    Hi,

    I have a question about the curriculum learning. For the 1 frame pretraining, both CC3M and WebVid 2M dataset are used. But when finetuning on 4 frames stage, did you use both video and image for joint pretraining (4 frames for WebVid 2M and 1 frame for CC3M)? Since I cannot find any experimental details for "Joint image-video training" in the paper.

    Thanks in advance.

    opened by vateye 4
  • The provided Pillow package doesn't support WEBP images

    The provided Pillow package doesn't support WEBP images

    A warning is shown during training with CC3M data saying so. See: https://github.com/ContinuumIO/anaconda-issues/issues/10737 So I guess these images are going to be skipped during training?

    opened by bryant1410 4
  • Code/template for the demo?

    Code/template for the demo?

    Awesome project and great work! I was wondering if the code for the video search demo is available or could be made available? Would be very nice to have even just for debugging the process of fine tuning your model on a different dataset.

    opened by scottfleming 3
  • About Curriculum Learning

    About Curriculum Learning

    Thanks for your great job! Heer are some questions about Curriculum Learning.

    When fine-tuning from 1 frame to 4 frames,

    • should we need to interpolate the temporal position embedding ([1, dim] => [4, dim])? In my opinion, the image is seen as 1-frame video, if the temporal position embedding is interpolated, how can we add it to the image?
    • should we use the same hyperparameters (e.g., learning rate, epoch, warmup)?
    opened by Andy1621 3
  • CC3M data error

    CC3M data error

    Files download from the given link in CC3M is not match to this code. It raises error when reading with pandas: pandas.errors.ParserError: Error tokenizing data. C error: Expected 43 fields in line 23, saw 45

    Can you provide the correct version of CC3M data ?

    opened by akira-l 3
  • How can I use the pretrain results?

    How can I use the pretrain results?

    https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/models/cc-webvid2m-4f_stformer_b_16_224.pth.tar

    This is the Pretrained Weights you listed in the website. But the checkpoint is .tar, and when I unzip it, it is a file, and I do not know how to use it as a normal .pth checkpoints?

    opened by Qi-Zhangyang 3
  • Finetuning the pretrained model on MSR-VTT

    Finetuning the pretrained model on MSR-VTT

    Hi,

    Thanks for your excellent work!

    When I finetune the pretrained model that you provide on MSR-VTT, there is a warning shown as below:

    "Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']"

    Is it expected?

    Thanks! Yuying

    opened by geyuying 3
  • Config for the method trained on CC3M

    Config for the method trained on CC3M

    I was wondering if you could provide the config used in the paper to train only on CC3M. Is it exactly like the one CC3M+WebVid but removing the WebVid part?

    opened by bryant1410 3
  • Result about MSVD

    Result about MSVD

    Hi, Bain I found the experiment setting about MSVD result is not clear, meanwhile MCQ claims a different version. so I want to ask the result and setting about MSVD.

    opened by shufangxun 2
  • About the effects of sliding_window_stride

    About the effects of sliding_window_stride

    Hi Bain, I see you mentioned that setting sliding_window_stride=12 when evaluating retrieval on MSR-VTT (finetuned) helps improve the performance in other issues. I tryied this but didn't get the improvement.

    After finetuned with msrvtt_4f_i21k.json, the model is tested as the command presented in the README. Results are:

    [t2v_metrics] epoch 0, R@1: 28.9, R@5: 55.6, R@10 66.2, R@50 86.8MedR: 4, MeanR: 29.9                                                                                                                                                                  
    [v2t_metrics] epoch 0, R@1: 28.4, R@5: 56.5, R@10 66.2, R@50 88.1MedR: 4, MeanR: 25.6  
    

    After setting --sliding_window_stride=12 for test.py, the results are:

    [t2v_metrics] epoch 0, R@1: 28.8, R@5: 57.7, R@10 68.5, R@50 88.0MedR: 4, MeanR: 27.3
    [v2t_metrics] epoch 0, R@1: 30.0, R@5: 58.8, R@10 68.8, R@50 89.7MedR: 4, MeanR: 22.5
    

    It shows no obvious improvement in my test.

    In #41, sliding_window_stride indeed helps improve the evaluation performance. I don't know why it doesn't work here. I keep codes in test.py unchanged and only modify some codes in base_dataset.py to fit my environment (i.e., lower version of PyTorch ans TorchVision due to the limitation on the computing cluster). Besides, the version of ffmpeg on my cluster is low and is hard to update. Is the difference of enviroments the reason leading to the poor results?

    I just want to use one trained model with sound performance for some experiments in test phase (e.g., adversarial attacks), so I would like to see the finetuned Frozen-in-Time with R@1 over 30% as results in your paper shows. However, I failed to get such a model. :(

    The phenomenon seems weird and I will check further to try to reproduce higher results. Besides, if possible, would you mind share a finetuned model?

    opened by xiangyh9988 0
  • Can you share some recordings of your experiments

    Can you share some recordings of your experiments

    Can you share some recordings of your experiments like some graphs in neptune.ai or other logs tracking the performance/loss changes in training steps.

    I would like to compare the effects of some configurations(e.g. batch size) on training convergence in depth. I think this uses a contrastive loss that depends on a similarity matrix, may be effected by batch size and converges slower in a smaller batch size. In your experiments, it was not using large batch sizes and may not achived the best performance yet. I think I want to try something haha~

    opened by KAndHisC 2
  • Off-by-one issues with the frame sampling

    Off-by-one issues with the frame sampling

    I think there may be 2 off-by-one issues with the frame sampling. I'm not so sure about it and prefer to discuss it, that's why I don't send a patch.

    For the first one, this is the part of the code:

    https://github.com/m-bain/frozen-in-time/blob/542164b80eb339e4dba520daf1182460d2a3d5a9/base/base_dataset.py#L152-L155

    I think it should be:

    np.linspace(start=0, stop=vlen - 1, ...)
    

    (with a - 1)

    and:

    ranges.append((interv, intervals[idx + 1]))
    

    (without the - 1).

    Otherwise, the right part of each bucket it's gonna be ignored. For the uniform case, instead of doing the interval centroid ((a+b)/2), it takes (a+b-1)/2. This isn't a big deal though.

    For the second one, I think the random choice interval end should be + 1. When it does random.choice(range(...)) (which btw could be a random.randrange), the range excludes the stop value, so there's another - 1 hidden there.

    For example, in the training video "1013731484", which has only one frame according to Decord, for random it'd be:

    intervals == [0, 1]
    ranges = [(0, 0)]
    random.choice(range(0, 0)) == random.choice([]) <- exception
    

    And it fails silently, assigning all frames to black. Note this one also isn't a big deal as it'd fail with few videos, and with the rest, it'd have all the intervals shifted or something like that.

    opened by bryant1410 0
Owner
PhD Student, VGG, Oxford
null
A selectional auto-encoder approach for document image binarization

The code of this repository was used for the following publication. If you find this code useful please cite our paper: @article{Gallego2019, title =

Javier Gallego 89 Nov 18, 2022
Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral) Paper | Project Page | Pre-trained Model | CLIP-Initialized Pre-trained

Applied Research Center (ARC), Tencent PCG 99 Jan 6, 2023
Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

ROCA: Robust CAD Model Alignment and Retrieval from a Single Image (CVPR 2022) Code release of our paper ROCA. Check out our video, paper, and website

null 123 Dec 25, 2022
Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

Jainam Shah 243 Dec 30, 2022
End-to-end pipeline for real-time scene text detection and recognition.

Real-time-Scene-Text-Detection-and-Recognition-System End-to-end pipeline for real-time scene text detection and recognition. The detection model use

Fangneng Zhan 89 Aug 4, 2022
textspotter - An End-to-End TextSpotter with Explicit Alignment and Attention

An End-to-End TextSpotter with Explicit Alignment and Attention This is initially described in our CVPR 2018 paper. Getting Started Installation Clone

Tong He 323 Nov 10, 2022
CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

简介 基于Tensorflow和Keras实现端到端的不定长中文字符检测和识别 文本检测:CTPN 文本识别:DenseNet + CTC 环境部署 sh setup.sh 注:CPU环境执行前需注释掉for gpu部分,并解开for cpu部分的注释 Demo 将测试图片放入test_images

Yang Chenguang 2.6k Dec 29, 2022
Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

DataTuner You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task. See LICENSE.txt for license de

null 81 Jan 1, 2023
This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

Amazon Web Services - Labs 422 Jan 3, 2023
The code of "Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes"

Mask TextSpotter A Pytorch implementation of Mask TextSpotter along with its extension can be find here Introduction This is the official implementati

Pengyuan Lyu 261 Nov 21, 2022
Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"

SEE: Towards Semi-Supervised End-to-End Scene Text Recognition Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text

Christian Bartz 572 Jan 5, 2023
Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

This repository hosts the source code of our paper: [AAAI 2021]Sequential End-to-end Network for Efficient Person Search. SeqNet achieves the state-of

Zj Li 218 Dec 31, 2022
Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Streaming speaker diarization Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation by Juan Manuel Coria, Hervé

Juanma Coria 185 Jan 1, 2023
A facial recognition device is a device that takes an image or a video of a human face and compares it to another image faces in a database.

A facial recognition device is a device that takes an image or a video of a human face and compares it to another image faces in a database. The structure, shape and proportions of the faces are compared during the face recognition steps.

Pavankumar Khot 4 Mar 19, 2022
A python screen recorder for low-end computers, provides high quality video output.

RecorderX - v1.0 A screen recorder made in Python with the help of OpenCv, it has ability to record your screen in high quality. No matter what your P

Priyanshu Jindal 4 Nov 10, 2021
Distort a video using Seam Carving (video) and Vibrato effect (sound)

Distort videos Applies a Seam Carving algorithm (aka liquid rescale) on every frame of a video, and a vibrato effect on the audio to distort the video

AlexZeGamer 6 Dec 6, 2022
Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

An Image is Worth 16x16 Words, What is a Video Worth? paper Official PyTorch Implementation Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor DAMO Academy, Al

null 213 Nov 12, 2022
An advanced 2D image manipulation with features such as edge detection and image segmentation built using OpenCV

OpenCV-ToothPaint3-Advanced-Digital-Image-Editor This application named ‘Tooth Paint’ version TP_2020.3 (64-bit) or version 3 was developed within a w

JunHong 1 Nov 5, 2021