MAGMA - a GPT-style multimodal model that can understand any combination of images and language

Related tags

Deep Learning magma
Overview

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Authors

repo (alphabetical)

Constantin (CoEich), Mayukh (Mayukhdeb), Sid (sdtblck)

paper

Constantin Eichenberg, Sidney Black, Samuel Weinbach, Aleph Alpha

Letitia Parcalabescu, Anette Frank, Heidelberg University

Abstract

Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on 0.2% of the number of samples used to train SimVLM.

Paper on arXiv: https://arxiv.org/abs/2112.05253

Examples (via Aleph Alpha playground)

Photos Text & Technical
A man covering a woman's eyes to hide a present A hand drawn treasure map
A fallen tree is blocking a road A software architecture

Model design

MAGMA model design

About the repository

In this repository we share the main parts of the codebase for training and inference of our MAGMA VL model. The main use of the repo is for downloading our pretrained weights and interacting with the model. We include a script for data parallel training with Deepspeed for finetuning our models or training a MAGMA model from scratch.

Installation

Make sure PyTorch (Ver >= 1.9.0) and Torchvision are installed. See https://pytorch.org/get-started/locally/.

You can pip install from the git repository with:

pip install git+https://github.com/Aleph-Alpha/magma.git

Make sure that you also download the config:

mkdir configs; wget -O configs/MAGMA_v1.yml https://raw.githubusercontent.com/Aleph-Alpha/magma/add-setup/configs/MAGMA_v1.yml

Or if you've cloned the repo, you can install all further requirements by:

pip install -r requirements.txt

Checkpoint

We also publish the model checkpoint that has been used for the publication. It is hosted on our infrastructure and downloads automatically. It can be downloaded manually here: https://bit.ly/aleph_alpha_magma_download

This checkpoint can also be played around with on a space managed by Heath Mitchell, AK, and Stella Biderman. (This is a 3rd party space, not managed by Aleph Alpha.)

Loading a model for inference

Downloads the checkpoint file into checkpoint_path if it's not already present.

from magma import Magma
from magma.image_input import ImageInput

model = Magma.from_checkpoint(
    config_path = "configs/MAGMA_v1.yml",
    checkpoint_path = "./mp_rank_00_model_states.pt",
    device = 'cuda:0'
)

inputs =[
    ## supports urls and path/to/image
    ImageInput('https://www.art-prints-on-demand.com/kunst/thomas_cole/woods_hi.jpg'),
    'Describe the painting:'
]

## returns a tensor of shape: (1, 149, 4096)
embeddings = model.preprocess_inputs(inputs)  

## returns a list of length embeddings.shape[0] (batch size)
output = model.generate(
    embeddings = embeddings,
    max_steps = 6,
    temperature = 0.7,
    top_k = 0,
)  

print(output[0]) ##  A cabin on a lake

Converting datasets to our format

To convert an image-caption dataset to our dataset class magma.datasets.ImgCptDataset, we suggest:

from magma.datasets.convert_datasets import convert_dataset

def my_dataset_iterator():
    """
    Implement an iterator for your dataset that for every datapoint yields a tuple
    image_path, {"captions": [...], "metadata": {...}, }, where image_path is the path to the image as a Path object, captions is a list of caption strings and metadata is an optional field.
    """

if __name__ == "__main__":
    convert_dataset(data_dir="/target/directory", ds_iterator=my_dataset_iterator())

How to train MAGMA

Run the training with:

deepspeed train.py --config path_to_my_config

To continue training from a deepspeed checkpoint, provide the checkpoint directory in the "load" config parameter.

WARNING: By default, instantiating magma via the init method instead of from_checkpoint loads the pretrained CLIP weights but not the pretrained gpt-j weights. For training MAGMA from scratch, download the gpt-j weights from this repo: https://github.com/finetuneanon/transformers and include them in the state dict after initializing the MAGMA model.

Comments
  • AssertionError: Parameter with name: lm.transformer.wte.weight occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behaviour

    AssertionError: Parameter with name: lm.transformer.wte.weight occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behaviour

    Hi,

    I'd like to rerun the code using gpt-neo125M or gpt2-med instead of gpt-nep2.7B ad I'm getting this error?

    AssertionError: Parameter with name: lm.transformer.wte.weight occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behaviour.

    Any idea why this issue exist for other language model?

    opened by monajati 6
  • No module named 'magma.transformers'

    No module named 'magma.transformers'

    Hi,

    just download the "magma-master" file and followed the instructions (I think), but trying to run test.py I get errors. It seems there are some parts missing?

    First I get: (magma) c:\Python\magma-master>python test.py Traceback (most recent call last): File "c:\Python\magma-master\test.py", line 4, in <module> from magma.language_model import get_language_model ImportError: cannot import name 'get_language_model' from 'magma.language_model' (c:\Python\magma-master\magma\language_model.py)>

    looking at the code it seems like get_language_model is not used anyhow, so commented line 4 out. But after that there is a similar miss:

    (magma) c:\Python\magma-master>python test.py Traceback (most recent call last): File "c:\Python\magma-master\test.py", line 25, in <module> from magma.transformers import GPTJForCausalLM ModuleNotFoundError: No module named 'magma.transformers'

    And here GPTJForCausalLM is used right in the next line. Looking at transformers.py there is just nothing like GPTJForCausalLM in there at all. Seems like something is missing here completly?

    Best Tuxius

    opened by Tuxius 6
  • Automatic model download doesn't work

    Automatic model download doesn't work

    mp_rank_00_model_states.pt ends up containing:

    <!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="t256RPQHLynZvvCq0ggl7w">/* Copyright 2022 Google Inc. All Rights Reserved. */
    .goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block,*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=1EiAY3IcKWmGADaLDzdG25ykQghUwza6L">mp_rank_00_model_states.pt</a> (12G)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://drive.google.com/uc?id=1EiAY3IcKWmGADaLDzdG25ykQghUwza6L&amp;export=download&amp;confirm=t" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>
    

    causing:

    Traceback (most recent call last):
      File "/home/ubuntu/magma/example_inference.py", line 4, in <module>
        model = Magma.from_checkpoint(
      File "/home/ubuntu/magma/magma/magma.py", line 292, in from_checkpoint
        sd = torch.load(checkpoint_path, map_location=torch.device("cpu"))
      File "/usr/local/share/miniconda/lib/python3.9/site-packages/torch/serialization.py", line 593, in load
        return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
      File "/usr/local/share/miniconda/lib/python3.9/site-packages/torch/serialization.py", line 762, in _legacy_load
        magic_number = pickle_module.load(f, **pickle_load_args)
    _pickle.UnpicklingError: invalid load key, '<'.
    

    Possibly related to https://github.com/wkentaro/gdown/issues/26

    opened by Heath123 5
  • Mismatching LM shape between 50400 (pre-trainined pt) and 50258 (gpt-2)

    Mismatching LM shape between 50400 (pre-trainined pt) and 50258 (gpt-2)

    Thanks for this wonderful work 😍
    I load mp_rank_00_model_states.pt, but it shows that the shape of LM is different:

    size mismatch for lm.lm_head.weight: copying a param with shape torch.Size([50400, 4096]) from checkpoint, the shape in current model is torch.Size([50258, 4096])
    

    I guess it is because of the resize_token_embeddings here.
    I also tried to truncate the additional dimension,

    sd["lm.lm_head.weight"] = sd["lm.lm_head.weight"][:50258, :]
    sd["lm.lm_head.bias"] = sd["lm.lm_head.bias"][:50258]
    

    but the result of example_inference.py seems weird 😂

    bondankeNM Drama fixtures Sergey
    Fantasticheddar AUTHOR hob sealedunction
    

    Super thanks for the help!
    opened by tsujuifu 3
  • Torch Size Mismatch

    Torch Size Mismatch

    Hey guys!

    I had a quick issue while loading Magma from the checkpoint, and I was wondering if anyone encountered or knows how to solve the problem.

    RuntimeError: Error(s) in loading state_dict for Magma: size mismatch for lm.lm_head.weight: copying a param with shape torch.Size([50400, 4096]) from checkpoint, the shape in current model is torch.Size([50258, 4096]).

    It seems like the size of the checkpoint model differs from the size of the model it is expecting from the rest of the code.

    Thank you so much--this model looks super cool and I'm excited to use it!

    opened by harshagundala 1
  • Subsequent inference calls produce less good results

    Subsequent inference calls produce less good results

    Following the code in README.md or example_inference.py to perform inference by calling model.preprocess_inputs(…) followed by model.generate(…) produces good results the first time the pair is called, but poor results for subsequent pairs of calls.

    The reason is that model = Magma.from_checkpoint(…) loads the model with inconsistent training/eval settings. model.training is True but model.image_prefix.enc.training is False. The first call to model.preprocess_inputs(…) works correctly as the image encoder has training False and so its Batch Normalisation steps work correctly. The call to model.generate(…) records the training state on entry and restores it on exit, which because model.training is True puts the whole model into training state. Subsequent calls to model.preprocess_inputs(…) then don't perform Batch Normalisation steps correctly.

    The play space at https://huggingface.co/spaces/EleutherAI/magma has this problem too.

    The fix is to add model.eval() after model = Magma.from_checkpoint(…), setting the whole model to a consistent eval state.

    opened by steve-barlow 1
  • size mismatch for lm.lm_head.weight: copying a param with shape torch.Size([50400, 4096]) from checkpoint, the shape in current model is torch.Size([50258, 4096]).

    size mismatch for lm.lm_head.weight: copying a param with shape torch.Size([50400, 4096]) from checkpoint, the shape in current model is torch.Size([50258, 4096]).

    size mismatch for lm.lm_head.weight: copying a param with shape torch.Size([50400, 4096]) from checkpoint, the shape in current model is torch.Size([50258, 4096]). this question is very strange. I didn't change any code, and I found that the model and the config have some mismatch. Does anyone meet the same question?

    opened by UCCME 1
  • top_p argument is used like 1-top_p

    top_p argument is used like 1-top_p

    For example, top_p=0.999 gives you nearly deterministic sampling, not nearly on-distribution sampling.


    I was confused why I was getting much less diverse samples with top_p=0.95 than I got with top_p turned off.

    I found the cause in these lines:

    https://github.com/Aleph-Alpha/magma/blob/bfd5c8def6a290f98b7eae34da120756f708cd38/magma/sampling.py#L11-L14

    threshold is set to top_p here:

    https://github.com/Aleph-Alpha/magma/blob/bfd5c8def6a290f98b7eae34da120756f708cd38/magma/sampling.py#L101-L102

    Suppose eg threshold is 0.95. Then 1-threshold is 0.05.

    So we remove all tokens where the cumulative probs are > 0.05, which is most of the tokens -- we are really doing top-p sampling with top_p=0.05 (in the usual convention), not the intended top_p=0.95.

    opened by nostalgebraist 1
  • (#9) Improved inference interface

    (#9) Improved inference interface

    Contains the following changes:

    1. the model, tokenizer, and transforms are now contained under a unified wrapper: Magma() which can be used as shown below:
    from multimodal_fewshot import Magma
    
    magma = Magma(
        checkpoint_path = 'mp_rank_00_model_states.pt', ## downloads automatically if not present in this path
        config_path = 'configs/MAGMA_v1.yml',
    )
    magma.to('cuda:0')
    
    1. Image inputs are now handled by ImageInput(), which supports both urls and local image paths.
    inputs =[
    	ImageInput('https://www.art-prints-on-demand.com/kunst/thomas_cole/woods_hi.jpg'),
    	'Describe the painting:'
    ]
    
    1. Magma() supports both low level and high level inference
    ## forward pass
    embeddings = magma.preprocess_inputs(inputs = inputs) ## returns a torch tensor of shape (1, sequence_length, hidden_dim)
    outputs = magma(embeddings) ## output logits shape: torch.Size([1, 150, 50400])
    
    ## high level inference
    completion = magma.generate(inputs = inputs, num_tokens = 4, topk = 1) 
    ## completion: "A cabin on a lake"
    
    opened by Mayukhdeb 1
  • Improved inference interface

    Improved inference interface

    Implement an interface like the one Mayukh suggested:

    from magma import Magma 
    from magma.image import Image, ImageFromURL  ## to easily load/use images
    
    model, tokenizer = Magma(checkpoint = 'model.pt', config = 'config.yml', device = 'cuda:0')
    
    inputs = [
        Image('path/to/image.jpg'),
        'Where is this ? A: Egypt',
        ImageFromURL('url/to/image.jpg'),
        'Where is this ? A:'
    ]
    
    embeddings = tokenizer.tokenize(inputs).to(model.device)
    
    output = model.forward(embeddings, output_attentions = True)
    
    logits = output.logits ## tensor of shape [1, len_seq, len_vocab]
    attentions = output.attentions ## list of tensors
    
    ## this already exists https://gitlab.aleph-alpha.de/research/multimodal_fewshot/-/blob/master/multimodal_fewshot/model.py#L442
    generated_text = model.generate(embeddings, n_steps = 10, *args)```
    
    opened by CoEich 1
  • Remove dataset builders and old classes in multimodal_fewshot.datasets

    Remove dataset builders and old classes in multimodal_fewshot.datasets

    • Remove scripts that download the various datasets
    • Keep the ImgCptDataset base class and classification wrappers in dataset.py, remove "old" classes
    • Keep the convert_dataset function in convert_dataset.py (maybe slightly refactor)
    opened by CoEich 1
  • how did you calculate the bleu score

    how did you calculate the bleu score

    Hi, thanks for the awesome project. I noticed that the reported BLEU@4 and CIDEr scores in Table 1 are ~10 and ~50 on the MS COCO dataset(zero-shot, after fine-tuning the scores increase to 31 and 90+), respectively, which fall far behind traditional baselines like AoA and CLIP-ViL(they usually achieve ~40 BLEU-4 and 120+ CIDEr). I am wondering whether the difference is due to the evaluation setup, did you use the evaluation in coco-caption or calculate the scores yourself?

    opened by TobiasLee 0
  • fix inference_step

    fix inference_step

    inference_step passes inference=True to model_engine. However, the __forward__ of the Magma model does not accept this parameter, which will cause an error during training. I fix it by simply copying the inference code from example_inference.py.

    opened by Fireblossom 3
Owner
Aleph Alpha GmbH
Aleph Alpha GmbH
Transfer style api - An API to use with Tranfer Style App, where you can use two image and transfer the style

Transfer Style API It's an API to use with Tranfer Style App, where you can use

Brian Alejandro 1 Feb 13, 2022
LIAO Shuiying 6 Dec 1, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

null 2.3k Jan 9, 2023
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

SVM Données Une base d’images contient 490 images pour l’apprentissage (400 voitures et 90 bateaux), et encore 21 images pour fait des tests. Prétrait

Achraf Rahouti 3 Nov 30, 2021
An architecture that makes any doodle realistic, in any specified style, using VQGAN, CLIP and some basic embedding arithmetics.

Sketch Simulator An architecture that makes any doodle realistic, in any specified style, using VQGAN, CLIP and some basic embedding arithmetics. See

null 12 Dec 18, 2022
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision Download links and PyTorch implementation of "Towers of Ba

Blakey Wu 40 Dec 14, 2022
Alex Pashevich 62 Dec 24, 2022
Only a Matter of Style: Age Transformation Using a Style-Based Regression Model

Only a Matter of Style: Age Transformation Using a Style-Based Regression Model The task of age transformation illustrates the change of an individual

null 444 Dec 30, 2022
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

null 1 Dec 24, 2021
Code in PyTorch for the convex combination linear IAF and the Householder Flow, J.M. Tomczak & M. Welling

VAE with Volume-Preserving Flows This is a PyTorch implementation of two volume-preserving flows as described in the following papers: Tomczak, J. M.,

Jakub Tomczak 87 Dec 26, 2022
A tool to analyze leveraged liquidity mining and find optimal option combination for hedging.

LP-Option-Hedging Description A Python program to analyze leveraged liquidity farming/mining and find the optimal option combination for hedging imper

Aureliano 18 Dec 19, 2022
PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

PyKale 370 Dec 27, 2022
Fast Neural Style for Image Style Transform by Pytorch

FastNeuralStyle by Pytorch Fast Neural Style for Image Style Transform by Pytorch This is famous Fast Neural Style of Paper Perceptual Losses for Real

Bengxy 81 Sep 3, 2022
A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

P-tuning A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''. How to use our code We have released the code

THUDM 562 Dec 27, 2022
All the essential resources and template code needed to understand and practice data structures and algorithms in python with few small projects to demonstrate their practical application.

Data Structures and Algorithms Python INDEX 1. Resources - Books Data Structures - Reema Thareja competitiveCoding Big-O Cheat Sheet DAA Syllabus Inte

Shushrut Kumar 129 Dec 15, 2022
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Jan 4, 2023