MERLOT: Multimodal Neural Script Knowledge Models

Related tags

Deep Learning merlot
Overview

merlot

MERLOT: Multimodal Neural Script Knowledge Models

MERLOT is a model for learning what we are calling "neural script knowledge" -- representations about what is going on in videos, spanning multiple video frames with associated captions.

Visit our project page at rowanzellers.com/merlot, or read the full paper to learn more.

teaser

What's here

We are releasing the following:

  • Code for the MERLOT model (in model/, with data processing in data/
  • Code for running MERLOT over visual story ordering.

We plan to release:

  • Information about the videos used in this work
  • Code for adapting the model to other tasks (not strictly needed, but just to make things easier)

This is somewhat ongoing -- we hope to make it somewhat easier to adapt MERLOT to other tasks, please follow if interested!

Enviroment and setup

There are two different ways of running MERLOT right now

  • Pretraining on videos This requires a TPU pod.
  • Finetuning on downstream tasks We did this on TPU v3-8 machines. You can in theory do this on GPUs, however, this isn't tested or officially supported right now.
  • Zero-shot visual-story ordering I have code for this on a TPU, but you should be able to do this on a GPU too.
conda create --name merlot python=3.7 && conda activate merlot
conda install -y python=3.7 tqdm numpy pyyaml scipy ipython cython typing h5py pandas

# If running on GPU
pip install tensorflow-gpu==1.15.5
# If running on TPU
pip install tensorflow==1.15.5

pip install --upgrade google-api-python-client oauth2client boto3 cloud-tpu-profiler regex opencv-python-headless Pillow seaborn
pip install numpy==1.17.0

Pretraining from scratch

This requires a large TPU pod for data-parallelism.

  • First, you'll need to get a bunch of training data in "tfrecord" format -- see data processing in data/ for that. You'll then need to adjust the configuration of model/configs/merlot.yaml accordingly. You'll also need to add in your output path (where you want your newly pretrained model to be saved).
  • Next, in the model directory, run python train.py configs/merlot.yaml

Finetuning on downstream tasks

  • We used the configuration model/merlot.yaml and the checkpoint at gs://merlot/checkpoint_4segments/ for downstream task finetuning. This is slightly different than the checkpoint we used for story unshuffling (that we had to adapt to account for the 5 frame-caption segments for that task), but both should work.
  • Actual finetuning code TBD -- you just create a MerlotModel model/modeling.py, set up your finetuning task (usually involving an additional output layer), and finetune.

Bibtex

@article{zellersluhessel2021merlot,
    title={MERLOT: Multimodal Neural Script Knowledge Models},
    author={Zellers, Rowan and Lu, Ximing and Hessel, Jack and Yu, Youngjae and Park, Jae Sung and Cao, Jize and Farhadi, Ali and Choi, Yejin},
    journal={arXiv preprint arXiv:2106.02636},
    year={2021}
}
Comments
  • Question about merlot model

    Question about merlot model

    Dear Rowan, Hi, I have noticed this paper recently, I really think this paper is of great value, I understand nearly all the details of your paper except the model. I know the details are in the code, but I am not familiar with TensorFlow, if you can explain these to me, I will understand the code much easier, so I wonder if you can answer my questions when you have time?

    1.What does chunk mean in the code? Does it represent the max number of segments a video has been segmented? 2.In 3.2, you said that MERLOT takes multiple unordered video frames as input, but in Joint Vision-Language Encoder part, you say that position embeddings are added to the vision components, do you mean that, when fed into the model, the image and the corresponding sentence have the same position embedding? 3.In 3.3, Temporal Reordering part, I understand the core idea, but I am not sure about your methods, is it correct that you randomly choose i frames, and then change the position embedding of these frames to the same embedding [image_unk_0]?

    Best regards, Zihao

    opened by ZihaoZheng98 4
  • [Question] Est. disk space to hold the pretraining dataset

    [Question] Est. disk space to hold the pretraining dataset

    Hi,

    Congrats on the impressive work. I was just wondering do you have a rough estimation about the disk quota required to host the YT-Temporal-180M dataset? Sorry if I missed this information in the manuscript.

    Thanks.

    opened by dxli94 3
  • Fine-tuning on VCR dataset

    Fine-tuning on VCR dataset

    Thanks for your great work. are you planing to release the code to fine-tune VCR task? I would appreciate it if you could release the code for data processing and data loading.

    opened by yanan1989 3
  • How to access the video dataset

    How to access the video dataset

    Thanks for your work. I was also wondering that how I can access the video dataset. Could you kindly send me the way to access the video dataset, please?

    opened by Minji-Seo 2
  • Access to video dataset?

    Access to video dataset?

    Hi Rowan,

    Congrats for your work. Indeed very interesting contribution. I was wondering what would be a way to get access to the video dataset that you've used in your experiments?

    Thanks, Alessandro

    opened by aleSuglia 2
  • YT-Temporal-180M video dataset

    YT-Temporal-180M video dataset

    Hi Rowanz,

       Thanks for your great work and contribution on MERLOT and YT-Temporal-180M dataset !
       Will you release the YT-Temporal-180M video dataset? If possible, can you provide us with the text annotation?
    

    Thanks

    opened by MrZihan 1
  •  How to download pre training model

    How to download pre training model

    Thank you for your work. I have a question about how to download the linked model (gs://merlot/checkpoint_ 4segments/) This doesn't seem to open through a browser

    opened by Curry-AI 1
  • Access to Video Dataset

    Access to Video Dataset

    Hi Rowanz,

    Thanks for your work and contribution. Will you release the YT-Temporal-180M video dataset? I'd like to get access to it.

    I already emailed you. so please check your email!

    Thanks, Shinyeong

    opened by dneirfi 0
  • Question on fair comparison with Conceptual ∪ COCO

    Question on fair comparison with Conceptual ∪ COCO

    Thanks for the great work. I have a question on fair comparison with Conceptual ∪ COCO.

    In the experiments on dataset source, you compared the model trained in Conceptual ∪ COCO datasets. For a fair comparison, you mentioned

    for a fair comparison, we train for the same number of steps as 5 epochs on our dataset.

    However, 5 epochs means the model has seen all 180M segment-transcripts pairs. As you've mentioned in the paper, there will be lesser overfitting issues.

    I think the proper way should be to train your model on 3M segment-transcript pairs / 3M videos.

    opened by SCZwangxiao 0
  • Issue on the model scalablity due to segment-level positional embeddings

    Issue on the model scalablity due to segment-level positional embeddings

    I notice that MERLOT adopts segment-level positional embeddings. However, there are only 16 segments during pre-training. For longer videos, e.g., movies, 16 segments are not enough to encode their information. Specifically, I have two questions:

    1. How to extract features for extremely long videos like movies?
    2. How about using fixed positional embeddings instead of learned ones?
    opened by SCZwangxiao 0
  • Question on the definition of visually

    Question on the definition of visually "ungrounded" categories

    I agree that some categories may not provide enough aligned vision-language information for multi-modal learning. However, in the paper, you mentioned "video game commentaries" as an example.

    I wonder why it is not visually grounded. The people's comments are usually related to the games. In my opinion, we could filter this category only for its unreality, which means it may not benefit downstream tasks.

    opened by SCZwangxiao 0
  • Code for preprocessing raw video data

    Code for preprocessing raw video data

    Hi, I can't find code for preprocessing raw videos and the meta data for raw videos. Could you please help me find that? By the way, it would be really nice if you provide the crawler code for videos and captions. Thanks!

    opened by TZWwww 0
  • Is finetuned checkpoint on VCR available?

    Is finetuned checkpoint on VCR available?

    Hi it seems that this repo released the pretrained checkpoints. Is the finetuned checkpoint on the VCR task also available? I also wonder approximately how many hours and how much cost it took to finetune for VCR using the current TPU set up. Thank you!

    opened by yrf1 0
  • Running funetuning on GPU

    Running funetuning on GPU

    Thanks for releasing your great work. I was wondering if there is a way to run the finetuning and zero-shot inference code on GPU rather than TPU? What king of adjustment would I need to make? Thanks

    opened by insundaycathy 1
Owner
Rowan Zellers
Rowan Zellers
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

MKGFormer Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion" Model Architecture Illu

ZJUNLP 68 Dec 28, 2022
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

TorchMultimodal (Alpha Release) Introduction TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Meta Research 663 Jan 6, 2023
[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

ZJU-VIPA 47 Jan 9, 2023
TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

Sayak Paul 67 Dec 20, 2022
Source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated Recurrent Memory Network

KaGRMN-DSG_ABSA This repository contains the PyTorch source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated

XingBowen 4 May 20, 2022
Deep Multimodal Neural Architecture Search

MMNas: Deep Multimodal Neural Architecture Search This repository corresponds to the PyTorch implementation of the MMnas for visual question answering

Vision and Language Group@ MIL 23 Dec 21, 2022
This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CNPs), Neural Processes (NPs), Attentive Neural Processes (ANPs).

The Neural Process Family This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CN

DeepMind 892 Dec 28, 2022
A Comparative Framework for Multimodal Recommender Systems

Cornac Cornac is a comparative framework for multimodal recommender systems. It focuses on making it convenient to work with models leveraging auxilia

Preferred.AI 671 Jan 3, 2023
This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

TransBTS: Multimodal Brain Tumor Segmentation Using Transformer This repo is the official implementation for TransBTS: Multimodal Brain Tumor Segmenta

Raymond 247 Dec 28, 2022
Rethinking the U-Net architecture for multimodal biomedical image segmentation

MultiResUNet Rethinking the U-Net architecture for multimodal biomedical image segmentation This repository contains the original implementation of "M

Nabil Ibtehaz 308 Jan 5, 2023
Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

XDVioDet Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020. The proj

peng 64 Dec 12, 2022
PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

PyKale 370 Dec 27, 2022
Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

Robotics and Perception Group 69 Dec 26, 2022
Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

CoMIR: Contrastive Multimodal Image Representation for Registration Framework ?? Registration of images in different modalities with Deep Learning ??

Methods for Image Data Analysis - MIDA 55 Dec 9, 2022
Alex Pashevich 62 Dec 24, 2022
Preprocessed Datasets for our Multimodal NER paper

Unified Multimodal Transformer (UMT) for Multimodal Named Entity Recognition (MNER) Two MNER Datasets and Codes for our ACL'2020 paper: Improving Mult

null 76 Dec 21, 2022
Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

The Second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021 Welcome to the Second Situated Interactive Multimodal Conversation

Facebook Research 81 Nov 22, 2022