Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

Rowan Zellers

Last update: Dec 11, 2022

Related tags

Deep Learning merlot_reserve

Overview

merlot_reserve

Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

MERLOT Reserve (in submission) is a model for learning joint representations of vision, language, and sound from YouTube. The learned model can be used in a zero-shot or finetuned setting, where it does well on tasks like VCR and TVQA.

Visit our project page at rowanzellers.com/merlotreserve or read the full paper to learn more.

What's here

We are releasing the following:

JAX code, and model checkpoints, for the MERLOT model
Code for pretraining the model
Code for finetuning the model on VCR and TVQA
Code for doing zero-shot inference with the model

Environment and setup

There are two different ways to run MERLOT Reserve:

Pretraining on videos You'll need a TPU Pod VM for this. This step shouldn't be necessary for most people, as we have released model checkpoints.
Finetuning on VCR or TVQA I've done this on a TPU v3-8 VM. This should be possible on GPU(s), but I haven't tested this on such hardware.
Zero-shot inference I've ran this on a GPU (even an older, Titan X from 2016 works.)

Installation on a GPU Machine

Install Cuda 11.4 (I used this link) and CUDNN 8.2. You might have to add something like this to your PATH:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Create the environment:

conda create --name mreserve python=3.8 && conda activate mreserve
conda install -y python=3.8 tqdm numpy pyyaml scipy ipython cython typing h5py pandas matplotlib

# Install jax
pip install jax[cuda11_cudnn82] -f https://storage.googleapis.com/jax-releases/jax_releases.html
# If doing this on TPUs instead of locally...
# pip install "jax[tpu]>=0.2.18" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# This is needed sometimes https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp
pip uninstall numpy
pip install numpy==1.19.5

pip install -r requirements.txt

You can then try out the interactive script at demo/demo_video.py. It will handle downloading the model checkpoint for you.

Installation on a Cloud TPU VM

See the instructions in pretrain/ to set up your environment on a TPU v3-8 VM.

Checkpoints

These should get auto-downloaded if you use PretrainedMerlotReserve in mreserve/modeling.py. All are flax checkpoint files:

# pretrained checkpoints
gs://merlotreserve/ckpts/base
gs://merlotreserve/ckpts/base_resadapt
gs://merlotreserve/ckpts/large
gs://merlotreserve/ckpts/large_resadapt

# finetuned checkpoints
gs://merlotreserve/vcr_ckpts/vcr_finetune_base
gs://merlotreserve/vcr_ckpts/vcr_finetune_large

gs://merlotreserve/tvqa_ckpts/tvqa_finetune_base
gs://merlotreserve/tvqa_ckpts/tvqa_finetune_large

# TVQA Data
gs://merlotreserve/finetune_data/tvqa/

# VCR data
gs://merlotreserve/finetune_data/vcr/

Comments

How could I get reference "txt.jsonl.zst" and What role does the "random text" in pretraining steps ?

Hi,

Thanks for releasing your work.

I'm currently trying to run your data/process.py code with customed crawled video.

And everything works well except the text_iterator().

I thought it is because I couldn't make "txt.jsonl.zst" which is going to use as random_text for pretraining batch

So I was wondering if there is any reference code or sample data to make "text.jsonl.zst" for my own ?

If it isn't possible, could you be able to explain the role of "random_text" in pretraining step for understanding your work ? (Because I couldn't understand to align the "random text" with MERLOT-Reserve pre-training objectives)

Thank you, Haena

opened by Haena0320 3
Zero-shot classification without audio

Thank you for the great repository.

How can we however run the model in zero-shot setup without audio? Concretly, function model.embed_video in demo_video.py requires argument audio_clips.

What can we do in order to not use the audio? Thank you!

Best, Tomas

opened by soCzech 2
saving intermediate tensors in jitted function

Hi Rowan,

I intend to save intermediate tensors (e.g. the embedding of Layer 11 of the joint transformer) when fine-tuning the tvqa dataset, so I can understand how the internal representations change during the time. However, I cannot save the concrete values of the layers' representations because they are encoded as Traced in jitted function (I got an error like The NumPy.ndarray conversion method array() was called on the JAX Tracer object).

I was wondering if you have already found a good solution to save them when you designed your codes. Thank you!

Best, Dota

opened by tianaidong 2
Still cannot get TVQA data

Hi Rowan,

I also could not open this link for tvqa data "https://storage.googleapis.com/merlotreserve/finetune_data/tvqa" Could you provide more details on the access to your tvqa data used in the paper. Thank you!

Best, Dota

opened by tianaidong 2
Video Segment Count in Demo

Hi Rowan,

In the code, you state the code only supports at most 8 segments. I would like to learn how you handle these segment counts in your demo at https://merlot.apps.allenai.org/

Thank you for your time and attention,

Mustafa

opened by mustafaadogan 2
TVQA Dataset

Hi Rowan,

Thank you for this great resource! I'm trying to reproduce the finetuning results on TVQA. I can't seem to access the google storage link though, and it looks like the TVQA dataset only gives access to video frames. Would you mind letting me know where you got the audio frames, or if there's anything not included in this link (once I get access)? https://tvqa.cs.unc.edu/download_tvqa.html

Best, Alex

opened by abwilf 2
Download dataset

Hi Rowan,

Really nice work and thanks for sharing the code!

In case I missed it, may I ask where the script to download all the youtube video is? I just found the processing script in the data/ folder.

opened by xvjiarui 2
How can I download TVQA audio?

Thank you for sharing this code.

I am trying to finetune on TVQA.

It seems like that audio is not available on the TVQA homepage.

How can I download TVQA audio?

opened by prote376 2

Could not automatically determine credentials.

Hi,

I tried to run the demo script but encountered the following error, it cannot download the model checkpoints.

(mreserve) yueyang1@nlpgpu01:/nlp/data/yueyang/merlot_reserve/demo> CUDA_VISIBLE_DEVICES=1 python demo_video.py
Traceback (most recent call last):
  File "demo_video.py", line 14, in <module>
    model = PretrainedMerlotReserve.from_pretrained(model_name='large', image_grid_size=grid_size)
  File "/mnt/nlpgridio3/data/yueyang/merlot_reserve/demo/../mreserve/modeling.py", line 968, in from_pretrained
    storage_client = storage.Client()
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/storage/client.py", line 123, in __init__
    super(Client, self).__init__(
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 318, in __init__
    _ClientProjectMixin.__init__(self, project=project, credentials=credentials)
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 266, in __init__
    project = self._determine_default(project)
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/client.py", line 285, in _determine_default
    return _determine_default_project(project)
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/cloud/_helpers.py", line 186, in _determine_default_project
    _, project = google.auth.default()
  File "/nlp/data/yueyang/miniconda3/miniconda3/envs/mreserve/lib/python3.8/site-packages/google/auth/_default.py", line 488, in default
    raise exceptions.DefaultCredentialsError(_HELP_MESSAGE)
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started

Hope to get a solution, thank you!

Yue

opened by YueYANG1996 2

hi,question about frozen embedding

Hi，

Thank you for your excellent work! I really want to ask you a question: now I want to only use your model to encode the video frame and the corresponding dialogue segment of the video, and then design the model by myself. Do I just copy the SPAN_encr and vision_enc model codes and download the checkpoint? :)

Best, Jun

opened by jun0wanan 1
IndexError: list index out of range

Getting this error with demo_video.py, with the video downloaded from youtube-dl when trying to read in video with ID "pmjPjZZRhNQ.mp4". Using CUDA 11.6 with python 3.8 in the mreserve conda environment.

opened by devksingh4 1
Can MERLOT-reserve be applied to short videos?

Hi，

Thank you for your excellent work!

I have noticed that you mention the limitations of the model in your paper: “Our model only learns from 40-second long videos”. So I wonder if this model can be applied to short video clips (like 5 seconds)? Is it feasible to reduce the time interval (5s) and number of video segments (16)?

Best, Fan

opened by ffcarina 0
Pretraining loss being negative

Is it possible to get negative loss for each task during pretraining? Also can you share the pretraining log file (mostly the loss of each task, i.e., audio2text, audio_text_matching etc.)?

opened by snat1505027 1
Release infill templates

Hello dear author,

Could you please release the infilled questions, i.e. the questions transformed to statements with <|MASK|> using GPT-3? I would be especially interested in the statements for MSRVTT-QA and TVQA.

It would be very helpful to release them, so other researchers don't have to run and pay GPT-3 for the same task again.

Thanks for consideration,

Simon

opened by gingsi 1
request for training data examples

Hello,

I am trying to process a dataset for training using data/process.py. Can you please share some example inputs? For example, what is the format of the youtube_dump/{video_id}/{video_id}.v2.info.json.gz file (in function load_video(), line 212)?

Thank you!

opened by sherylm77 0

Relative Location of input for TVQA

Hi, I have a question about the relative location for TVQA. `t_start = midpoint - segment_size * 0.5 t_end = midpoint + segment_size * 0.5

# Try to extend by 3 segments in either direction of the middle
times_used0 = [{'start_time': t_start, 'end_time': t_end}]
for i in range(6):
    for delta in [-segment_size, segment_size]:
        t0 = t_start + delta * (i+1)
        t1 = t_end + delta * (i+1)

        t0 = round(t0 * 3) / 3
        t1 = round(t1 * 3) / 3

        if t1 < 0:
            continue
        if t0 > max_time:
            continue
        if len(times_used0) < 7:
            times_used0.append({'start_time': t0, 'end_time': t1})
times_used0 = sorted(times_used0, key=lambda x: x['start_time'])

# Figure out the relative position of the annotation
my_duration = times_used0[-1]['end_time'] - times_used[0]['start_time']
rel_localized_tstart = (ts0 - times_used[0]['start_time']) / my_duration
rel_localized_tend = (ts1 - times_used[0]['start_time']) / my_duration
qa_item['rel_localization'] = (rel_localized_tstart, rel_localized_tend)`

For the above code, I suspect that the rel_localized_tstart could be greater than rel_localized_tend since the "midpoint - segment_size * 0.5" could less than zero?

Besides, does the rel_localized_tstart or rel_localized_tend can be a negative number?

opened by vateye 1

Owner

Rowan Zellers

GitHub

The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

Domain Generalization for Medical Imaging Classification with Linear Dependency Regularization The code release of paper 'Domain Generalization for Me

56 Dec 28, 2022

Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

Related tags

Overview

merlot_reserve

What's here

Environment and setup

Installation on a GPU Machine

Installation on a Cloud TPU VM

Checkpoints

Comments

Owner

Rowan Zellers

The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

This is the official code release for the paper Shape and Material Capture at Home

Code release for "COTR: Correspondence Transformer for Matching Across Images"

Code release for paper: The Boombox: Visual Reconstruction from Acoustic Vibrations

We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

Code release to accompany paper "Geometry-Aware Gradient Algorithms for Neural Architecture Search."

This is the dataset and code release of the OpenRooms Dataset.

Code release of paper "Deep Multi-View Stereo gone wild"

Code release for DS-NeRF (Depth-supervised Neural Radiance Fields)

Code release for BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

Code Release for Learning to Adapt to Evolving Domains

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Code release for our paper, "SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo"

Code release for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification (TIP 2020)

Code release for NeurIPS 2020 paper "Co-Tuning for Transfer Learning"

Code release for NeuS

Code Release for ICCV 2021 (oral), "AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds"

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".