MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers

Last update: Dec 22, 2022

Related tags

Deep Learning merlot

Overview

merlot

MERLOT: Multimodal Neural Script Knowledge Models

MERLOT is a model for learning what we are calling "neural script knowledge" -- representations about what is going on in videos, spanning multiple video frames with associated captions.

Visit our project page at rowanzellers.com/merlot, or read the full paper to learn more.

What's here

We are releasing the following:

Code for the MERLOT model (in model/, with data processing in data/
Code for running MERLOT over visual story ordering.

We plan to release:

Information about the videos used in this work
Code for adapting the model to other tasks (not strictly needed, but just to make things easier)

This is somewhat ongoing -- we hope to make it somewhat easier to adapt MERLOT to other tasks, please follow if interested!

Enviroment and setup

There are two different ways of running MERLOT right now

Pretraining on videos This requires a TPU pod.
Finetuning on downstream tasks We did this on TPU v3-8 machines. You can in theory do this on GPUs, however, this isn't tested or officially supported right now.
Zero-shot visual-story ordering I have code for this on a TPU, but you should be able to do this on a GPU too.

conda create --name merlot python=3.7 && conda activate merlot
conda install -y python=3.7 tqdm numpy pyyaml scipy ipython cython typing h5py pandas

# If running on GPU
pip install tensorflow-gpu==1.15.5
# If running on TPU
pip install tensorflow==1.15.5

pip install --upgrade google-api-python-client oauth2client boto3 cloud-tpu-profiler regex opencv-python-headless Pillow seaborn
pip install numpy==1.17.0

Pretraining from scratch

This requires a large TPU pod for data-parallelism.

First, you'll need to get a bunch of training data in "tfrecord" format -- see data processing in data/ for that. You'll then need to adjust the configuration of model/configs/merlot.yaml accordingly. You'll also need to add in your output path (where you want your newly pretrained model to be saved).
Next, in the model directory, run python train.py configs/merlot.yaml

Finetuning on downstream tasks

We used the configuration model/merlot.yaml and the checkpoint at gs://merlot/checkpoint_4segments/ for downstream task finetuning. This is slightly different than the checkpoint we used for story unshuffling (that we had to adapt to account for the 5 frame-caption segments for that task), but both should work.
Actual finetuning code TBD -- you just create a MerlotModel model/modeling.py, set up your finetuning task (usually involving an additional output layer), and finetune.

Bibtex

@article{zellersluhessel2021merlot,
    title={MERLOT: Multimodal Neural Script Knowledge Models},
    author={Zellers, Rowan and Lu, Ximing and Hessel, Jack and Yu, Youngjae and Park, Jae Sung and Cao, Jize and Farhadi, Ali and Choi, Yejin},
    journal={arXiv preprint arXiv:2106.02636},
    year={2021}
}

Comments

Question about merlot model

Dear Rowan, Hi, I have noticed this paper recently, I really think this paper is of great value, I understand nearly all the details of your paper except the model. I know the details are in the code, but I am not familiar with TensorFlow, if you can explain these to me, I will understand the code much easier, so I wonder if you can answer my questions when you have time?

1.What does chunk mean in the code? Does it represent the max number of segments a video has been segmented? 2.In 3.2, you said that MERLOT takes multiple unordered video frames as input, but in Joint Vision-Language Encoder part, you say that position embeddings are added to the vision components, do you mean that, when fed into the model, the image and the corresponding sentence have the same position embedding? 3.In 3.3, Temporal Reordering part, I understand the core idea, but I am not sure about your methods, is it correct that you randomly choose i frames, and then change the position embedding of these frames to the same embedding [image_unk_0]?

Best regards, Zihao

opened by ZihaoZheng98 4
[Question] Est. disk space to hold the pretraining dataset

Hi,

Congrats on the impressive work. I was just wondering do you have a rough estimation about the disk quota required to host the YT-Temporal-180M dataset? Sorry if I missed this information in the manuscript.

Thanks.

opened by dxli94 3
Fine-tuning on VCR dataset

Thanks for your great work. are you planing to release the code to fine-tune VCR task? I would appreciate it if you could release the code for data processing and data loading.

opened by yanan1989 3
How to access the video dataset

Thanks for your work. I was also wondering that how I can access the video dataset. Could you kindly send me the way to access the video dataset, please?

opened by Minji-Seo 2
Access to video dataset?

Hi Rowan,

Congrats for your work. Indeed very interesting contribution. I was wondering what would be a way to get access to the video dataset that you've used in your experiments?

Thanks, Alessandro

opened by aleSuglia 2

YT-Temporal-180M video dataset

Hi Rowanz,

   Thanks for your great work and contribution on MERLOT and YT-Temporal-180M dataset !
   Will you release the YT-Temporal-180M video dataset? If possible, can you provide us with the text annotation?

Thanks

opened by MrZihan 1

How to download pre training model

Thank you for your work. I have a question about how to download the linked model (gs://merlot/checkpoint_ 4segments/) This doesn't seem to open through a browser

opened by Curry-AI 1
Access to Video Dataset

Hi Rowanz,

Thanks for your work and contribution. Will you release the YT-Temporal-180M video dataset? I'd like to get access to it.

I already emailed you. so please check your email!

Thanks, Shinyeong

opened by dneirfi 0
Question on fair comparison with Conceptual ∪ COCO

Thanks for the great work. I have a question on fair comparison with Conceptual ∪ COCO.

In the experiments on dataset source, you compared the model trained in Conceptual ∪ COCO datasets. For a fair comparison, you mentioned

for a fair comparison, we train for the same number of steps as 5 epochs on our dataset.

However, 5 epochs means the model has seen all 180M segment-transcripts pairs. As you've mentioned in the paper, there will be lesser overfitting issues.

I think the proper way should be to train your model on 3M segment-transcript pairs / 3M videos.

opened by SCZwangxiao 0
Issue on the model scalablity due to segment-level positional embeddings
I notice that MERLOT adopts segment-level positional embeddings. However, there are only 16 segments during pre-training. For longer videos, e.g., movies, 16 segments are not enough to encode their information. Specifically, I have two questions:

How to extract features for extremely long videos like movies?

How about using fixed positional embeddings instead of learned ones?
opened by SCZwangxiao 0
Question on the definition of visually "ungrounded" categories

I agree that some categories may not provide enough aligned vision-language information for multi-modal learning. However, in the paper, you mentioned "video game commentaries" as an example.

I wonder why it is not visually grounded. The people's comments are usually related to the games. In my opinion, we could filter this category only for its unreality, which means it may not benefit downstream tasks.

opened by SCZwangxiao 0
Code for preprocessing raw video data

Hi, I can't find code for preprocessing raw videos and the meta data for raw videos. Could you please help me find that? By the way, it would be really nice if you provide the crawler code for videos and captions. Thanks!

opened by TZWwww 0
Is finetuned checkpoint on VCR available?

Hi it seems that this repo released the pretrained checkpoints. Is the finetuned checkpoint on the VCR task also available? I also wonder approximately how many hours and how much cost it took to finetune for VCR using the current TPU set up. Thank you!

opened by yrf1 0
Running funetuning on GPU

Thanks for releasing your great work. I was wondering if there is a way to run the finetuning and zero-shot inference code on GPU rather than TPU? What king of adjustment would I need to make? Thanks

opened by insundaycathy 1

MERLOT: Multimodal Neural Script Knowledge Models

Related tags

Overview

merlot

What's here

Enviroment and setup

Pretraining from scratch

Finetuning on downstream tasks

Bibtex

Comments

Owner

Rowan Zellers

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

Source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated Recurrent Memory Network

Deep Multimodal Neural Architecture Search

This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CNPs), Neural Processes (NPs), Attentive Neural Processes (ANPs).

A Comparative Framework for Multimodal Recommender Systems

This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

Rethinking the U-Net architecture for multimodal biomedical image segmentation

Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Preprocessed Datasets for our Multimodal NER paper

Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations