Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

Overview

merlot_reserve

Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

MERLOT Reserve (in submission) is a model for learning joint representations of vision, language, and sound from YouTube. The learned model can be used in a zero-shot or finetuned setting, where it does well on tasks like VCR and TVQA.

Visit our project page at rowanzellers.com/merlotreserve or read the full paper to learn more.

What's here

We are releasing the following:

  • JAX code, and model checkpoints, for the MERLOT model
  • Code for pretraining the model
  • Code for finetuning the model on VCR and TVQA
  • Code for doing zero-shot inference with the model

Environment and setup

There are two different ways to run MERLOT Reserve:

  • Pretraining on videos You'll need a TPU Pod VM for this. This step shouldn't be necessary for most people, as we have released model checkpoints.
  • Finetuning on VCR or TVQA I've done this on a TPU v3-8 VM. This should be possible on GPU(s), but I haven't tested this on such hardware.
  • Zero-shot inference I've ran this on a GPU (even an older, Titan X from 2016 works.)

Installation on a GPU Machine

Install Cuda 11.4 (I used this link) and CUDNN 8.2. You might have to add something like this to your PATH:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Create the environment:

conda create --name mreserve python=3.8 && conda activate mreserve
conda install -y python=3.8 tqdm numpy pyyaml scipy ipython cython typing h5py pandas matplotlib

# Install jax
pip install jax[cuda11_cudnn82] -f https://storage.googleapis.com/jax-releases/jax_releases.html
# If doing this on TPUs instead of locally...
# pip install "jax[tpu]>=0.2.18" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# This is needed sometimes https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp
pip uninstall numpy
pip install numpy==1.19.5

pip install -r requirements.txt

You can then try out the interactive script at demo/demo_video.py. It will handle downloading the model checkpoint for you.

Installation on a Cloud TPU VM

See the instructions in pretrain/ to set up your environment on a TPU v3-8 VM.

Checkpoints

These should get auto-downloaded if you use PretrainedMerlotReserve in mreserve/modeling.py. All are flax checkpoint files:

# pretrained checkpoints
gs://merlotreserve/ckpts/base
gs://merlotreserve/ckpts/base_resadapt
gs://merlotreserve/ckpts/large
gs://merlotreserve/ckpts/large_resadapt

# finetuned checkpoints
gs://merlotreserve/vcr_ckpts/vcr_finetune_base
gs://merlotreserve/vcr_ckpts/vcr_finetune_large

gs://merlotreserve/tvqa_ckpts/tvqa_finetune_base
gs://merlotreserve/tvqa_ckpts/tvqa_finetune_large

# TVQA Data
gs://merlotreserve/finetune_data/tvqa/

# VCR data
gs://merlotreserve/finetune_data/vcr/
Comments
  • How could I get reference

    How could I get reference "txt.jsonl.zst" and What role does the "random text" in pretraining steps ?

    Hi,

    Thanks for releasing your work.

    I'm currently trying to run your data/process.py code with customed crawled video.

    And everything works well except the text_iterator().

    I thought it is because I couldn't make "txt.jsonl.zst" which is going to use as random_text for pretraining batch

    So I was wondering if there is any reference code or sample data to make "text.jsonl.zst" for my own ?

    If it isn't possible, could you be able to explain the role of "random_text" in pretraining step for understanding your work ? (Because I couldn't understand to align the "random text" with MERLOT-Reserve pre-training objectives)

    Thank you, Haena

    opened by Haena0320 3
  • Zero-shot classification without audio

    Zero-shot classification without audio

    Thank you for the great repository.

    How can we however run the model in zero-shot setup without audio? Concretly, function model.embed_video in demo_video.py requires argument audio_clips.

    What can we do in order to not use the audio? Thank you!

    Best, Tomas

    opened by soCzech 2
  • saving intermediate tensors in jitted function

    saving intermediate tensors in jitted function

    Hi Rowan,

    I intend to save intermediate tensors (e.g. the embedding of Layer 11 of the joint transformer) when fine-tuning the tvqa dataset, so I can understand how the internal representations change during the time. However, I cannot save the concrete values of the layers' representations because they are encoded as Traced in jitted function (I got an error like The NumPy.ndarray conversion method array() was called on the JAX Tracer object).

    I was wondering if you have already found a good solution to save them when you designed your codes. Thank you!

    Best, Dota

    opened by tianaidong 2
  • Still cannot get TVQA data

    Still cannot get TVQA data

    Hi Rowan,

    I also could not open this link for tvqa data "https://storage.googleapis.com/merlotreserve/finetune_data/tvqa" Could you provide more details on the access to your tvqa data used in the paper. Thank you!

    Best, Dota

    opened by tianaidong 2
  • Video Segment Count in Demo

    Video Segment Count in Demo

    Hi Rowan,

    In the code, you state the code only supports at most 8 segments. I would like to learn how you handle these segment counts in your demo at https://merlot.apps.allenai.org/

    Thank you for your time and attention,

    Mustafa

    opened by mustafaadogan 2
  • TVQA Dataset

    TVQA Dataset

    Hi Rowan,

    Thank you for this great resource! I'm trying to reproduce the finetuning results on TVQA. I can't seem to access the google storage link though, and it looks like the TVQA dataset only gives access to video frames. Would you mind letting me know where you got the audio frames, or if there's anything not included in this link (once I get access)? https://tvqa.cs.unc.edu/download_tvqa.html

    Best, Alex

    opened by abwilf 2
  • Download dataset

    Download dataset

    Hi Rowan,

    Really nice work and thanks for sharing the code!

    In case I missed it, may I ask where the script to download all the youtube video is? I just found the processing script in the data/ folder.

    opened by xvjiarui 2
  • How can I download TVQA audio?

    How can I download TVQA audio?

    Thank you for sharing this code.

    I am trying to finetune on TVQA.

    It seems like that audio is not available on the TVQA homepage.

    How can I download TVQA audio?

    opened by prote376 2
  • Could not automatically determine credentials.

    Could not automatically determine credentials.

    Hi,

    I tried to run the demo script but encountered the following error, it cannot download the model checkpoints.

    (mreserve) yueyang1@nlpgpu01:/nlp/data/yueyang/merlot_reserve/demo> CUDA_VISIBLE_DEVICES=1 python demo_video.py
    Traceback (most recent call last):
      File "demo_video.py", line 14, in <module>
        model = PretrainedMerlotReserve.from_pretrained(model_name='large', image_grid_size=grid_size)
      File "/mnt/nlpgridio3/data/yueyang/merlot_reserve/demo/../mreserve/modeling.py", line 968, in from_pretrained
        storage_client = storage.Client()
      File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/storage/client.py", line 123, in __init__
        super(Client, self).__init__(
      File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 318, in __init__
        _ClientProjectMixin.__init__(self, project=project, credentials=credentials)
      File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 266, in __init__
        project = self._determine_default(project)
      File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 285, in _determine_default
        return _determine_default_project(project)
      File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/_helpers.py", line 186, in _determine_default_project
        _, project = google.auth.default()
      File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/auth/_default.py", line 488, in default
        raise exceptions.DefaultCredentialsError(_HELP_MESSAGE)
    google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started 
    

    Hope to get a solution, thank you!

    Yue

    opened by YueYANG1996 2
  • hi,question about frozen embedding

    hi,question about frozen embedding

    Hi,

    Thank you for your excellent work! I really want to ask you a question: now I want to only use your model to encode the video frame and the corresponding dialogue segment of the video, and then design the model by myself. Do I just copy the SPAN_encr and vision_enc model codes and download the checkpoint? :)

    Best, Jun

    opened by jun0wanan 1
  • IndexError: list index out of range

    IndexError: list index out of range

    Getting this error with demo_video.py, with the video downloaded from youtube-dl when trying to read in video with ID "pmjPjZZRhNQ.mp4". Using CUDA 11.6 with python 3.8 in the mreserve conda environment.

    opened by devksingh4 1
  • Can MERLOT-reserve be applied to short videos?

    Can MERLOT-reserve be applied to short videos?

    Hi,

    Thank you for your excellent work!

    I have noticed that you mention the limitations of the model in your paper: “Our model only learns from 40-second long videos”. So I wonder if this model can be applied to short video clips (like 5 seconds)? Is it feasible to reduce the time interval (5s) and number of video segments (16)?

    Best, Fan

    opened by ffcarina 0
  • Pretraining loss being negative

    Pretraining loss being negative

    Is it possible to get negative loss for each task during pretraining? Also can you share the pretraining log file (mostly the loss of each task, i.e., audio2text, audio_text_matching etc.)?

    opened by snat1505027 1
  • Release infill templates

    Release infill templates

    Hello dear author,

    Could you please release the infilled questions, i.e. the questions transformed to statements with <|MASK|> using GPT-3? I would be especially interested in the statements for MSRVTT-QA and TVQA.

    It would be very helpful to release them, so other researchers don't have to run and pay GPT-3 for the same task again.

    Thanks for consideration,

    Simon

    opened by gingsi 1
  • request for training data examples

    request for training data examples

    Hello,

    I am trying to process a dataset for training using data/process.py. Can you please share some example inputs? For example, what is the format of the youtube_dump/{video_id}/{video_id}.v2.info.json.gz file (in function load_video(), line 212)?

    Thank you!

    opened by sherylm77 0
  • Relative Location of input for TVQA

    Relative Location of input for TVQA

    Hi, I have a question about the relative location for TVQA. `t_start = midpoint - segment_size * 0.5 t_end = midpoint + segment_size * 0.5

    # Try to extend by 3 segments in either direction of the middle
    times_used0 = [{'start_time': t_start, 'end_time': t_end}]
    for i in range(6):
        for delta in [-segment_size, segment_size]:
            t0 = t_start + delta * (i+1)
            t1 = t_end + delta * (i+1)
    
            t0 = round(t0 * 3) / 3
            t1 = round(t1 * 3) / 3
    
            if t1 < 0:
                continue
            if t0 > max_time:
                continue
            if len(times_used0) < 7:
                times_used0.append({'start_time': t0, 'end_time': t1})
    times_used0 = sorted(times_used0, key=lambda x: x['start_time'])
    
    # Figure out the relative position of the annotation
    my_duration = times_used0[-1]['end_time'] - times_used[0]['start_time']
    rel_localized_tstart = (ts0 - times_used[0]['start_time']) / my_duration
    rel_localized_tend = (ts1 - times_used[0]['start_time']) / my_duration
    qa_item['rel_localization'] = (rel_localized_tstart, rel_localized_tend)`
    

    For the above code, I suspect that the rel_localized_tstart could be greater than rel_localized_tend since the "midpoint - segment_size * 0.5" could less than zero?

    Besides, does the rel_localized_tstart or rel_localized_tend can be a negative number?

    opened by vateye 1
Owner
Rowan Zellers
Rowan Zellers
The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

Domain Generalization for Medical Imaging Classification with Linear Dependency Regularization The code release of paper 'Domain Generalization for Me

Yufei Wang 56 Dec 28, 2022
Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

Transferable Semantic Augmentation for Domain Adaptation Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021) Paper

null 66 Dec 16, 2022
This is the official code release for the paper Shape and Material Capture at Home

This is the official code release for the paper Shape and Material Capture at Home. The code enables you to reconstruct a 3D mesh and Cook-Torrance BRDF from one or more images captured with a flashlight or camera with flash.

null 89 Dec 10, 2022
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 360 Jan 6, 2023
Code release for paper: The Boombox: Visual Reconstruction from Acoustic Vibrations

The Boombox: Visual Reconstruction from Acoustic Vibrations Boyuan Chen, Mia Chiquier, Hod Lipson, Carl Vondrick Columbia University Project Website |

Boyuan Chen 12 Nov 30, 2022
We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

ConTNet Introduction ConTNet (Convlution-Tranformer Network) is proposed mainly in response to the following two issues: (1) ConvNets lack a large rec

null 93 Nov 8, 2022
Code release to accompany paper "Geometry-Aware Gradient Algorithms for Neural Architecture Search."

Geometry-Aware Gradient Algorithms for Neural Architecture Search This repository contains the code required to run the experiments for the DARTS sear

null 18 May 27, 2022
This is the dataset and code release of the OpenRooms Dataset.

This is the dataset and code release of the OpenRooms Dataset.

Visual Intelligence Lab of UCSD 95 Jan 8, 2023
Code release of paper "Deep Multi-View Stereo gone wild"

Deep MVS gone wild Pytorch implementation of "Deep MVS gone wild" (Paper | website) This repository provides the code to reproduce the experiments of

François Darmon 53 Dec 24, 2022
Code release for DS-NeRF (Depth-supervised Neural Radiance Fields)

Depth-supervised NeRF: Fewer Views and Faster Training for Free Project | Paper | YouTube Pytorch implementation of our method for learning neural rad

null 524 Jan 8, 2023
Code release for BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

BlockGAN Code release for BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images BlockGAN: Learning 3D Object-aware Scene Rep

null 41 May 18, 2022
Code Release for Learning to Adapt to Evolving Domains

EAML Code release for "Learning to Adapt to Evolving Domains" (NeurIPS 2020) Prerequisites PyTorch >= 0.4.0 (with suitable CUDA and CuDNN version) tor

null 23 Dec 7, 2022
Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Self-Tuning for Data-Efficient Deep Learning This repository contains the implementation code for paper: Self-Tuning for Data-Efficient Deep Learning

THUML @ Tsinghua University 101 Dec 11, 2022
Code release for our paper, "SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo"

SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo Thomas Kollar, Michael Laskey, Kevin Stone, Brijen Thananjeyan

null 68 Dec 14, 2022
Code release for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification (TIP 2020)

The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification Code release for The Devil is in the Channels: Mutual-Channel

PRIS-CV: Computer Vision Group 230 Dec 31, 2022
Code release for NeurIPS 2020 paper "Co-Tuning for Transfer Learning"

CoTuning Official implementation for NeurIPS 2020 paper Co-Tuning for Transfer Learning. [News] 2021/01/13 The COCO 70 dataset used in the paper is av

THUML @ Tsinghua University 35 Sep 23, 2022
Code release for NeuS

NeuS We present a novel neural surface reconstruction method, called NeuS, for reconstructing objects and scenes with high fidelity from 2D image inpu

Peng Wang 813 Jan 4, 2023
Code Release for ICCV 2021 (oral), "AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds"

AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds (ICCV 2021 oral) **Project Page | Arxiv ** Runsong Zhu¹, Yuan Liu², Zhen Dong¹, Te

null 40 Dec 30, 2022
Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

PixelTransformer Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation". Project Page Installation Please insta

Shubham Tulsiani 24 Dec 17, 2022