CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Related tags

Deep Learning cloob

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Andreas Fürst* 1, Elisabeth Rumetshofer* 1, Viet Tran1, Hubert Ramsauer1, Fei Tang3, Johannes Lehner1, David Kreil2, Michael Kopp2, Günter Klambauer1, Angela Bitto-Nemling1, Sepp Hochreiter1 2

1 ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria
2 Institute of Advanced Research in Artificial Intelligence (IARAI)
3 HERE Technologies
* Equal contribution

Detailed blog post on this paper at this link.

The full paper is available here.

Implementation of CLOOB

This repository contains the implemenation of CLOOB used to obtain the results reported in the paper. The implementation is based on OpenCLIP, an open source implementation of OpenAI's CLIP.


We provide an 'environment.yml' file to set up a conda environment with all required packages. Run the following command to clone the repository and create the environment.

# Clone repository and swtich into the directory
git clone
cd cloob

# Create the environment and activate it
conda env create --file environment.yml
conda activate cloob

# Additionally, webdataset needs to be installed from git repo for pre-training on YFCC 
pip install git+

# Add the directory to the PYTHONPATH environment variable


For pre-training we use the two datasets supported by OpenCLIP, namely Conceptual Captions and YFCC.

Conceptual Captions

OpenCLIP already provides a script to download and prepare the Conceptual Captions dataset, which contains 2.89M training images and 13k validation images. First, download the Conceptual Captions URLs and then run the script

python3 src/data/ path/to/Train_GCC-training.tsv path/to/Validation_GCC-1.1.0-Validation.tsv


We use the same subset of ~15M images from the YFCC100M dataset as CLIP. They provide a list of (line number, photo identifier, photo hash) of each image contained in this subset here.

For more information see YFCC100m Subset on OpenAI's github.

Downstream Tasks

In the paper we report results on several downstream tasks. Except for ImageNet we provide links to already pre-processed versions (where necessary) of the respective test set.

Dataset Description Official Processed
Birdsnap This dataset contains images of North American bird species, however
our dataset is smaller than reported in CLIP as some samples are no longer available.
Link Link
Country211 This dataset was published in CLIP and is a small subset of the YFCC100m dataset.
It consists of photos that can be assigned to 211 countries via GPS coordinates.
For each country 200 photos are sampled for the training set and 100 for testing.
Link Link
Flowers102 Images of 102 flower categories commonly occuring in the United Kingdom were collected.
Several classes are very similar and there is a large variation in scale, pose and lighting.
Link Link
GTSRB This dataset was released for a challenge held at the IJCNN 2011.
The dataset contains images of german traffic signs from more than 40 classes.
Link Link
Stanford Cars This dataset contains images of 196 car models at the level of make,
model and year (e.g. Tesla Model S Sedan 2012).
Link Link
UCF101 The dataset has been created by extracting the middle frame from each video. Link Link
ImageNet This dataset spans 1000 object classes and contains 1,281,167 training images,
50,000 validation images and 100,000 test images.
Link -
ImageNet v2 The ImageNetV2 dataset contains new test data for the ImageNet benchmark. Link -


In the following there is an example command for pretraining on CC with an effective batch size of 512 when used on 4 GPUs.

/conceptual_captions/Train-GCC-training_output.csv" \ --val-data=" /conceptual_captions/Validation_GCC-1.1.0-Validation_output.csv" \ --path-data=" /conceptual_captions" \ --imagenet-val=" /imagenet/val" \ --warmup 20000 \ --batch-size=128 \ --lr=1e-3 \ --wd=0.1 \ --lr-scheduler="cosine-restarts" \ --restart-cycles=10 \ --epochs=70 \ --method="cloob" \ --init-inv-tau=30 \ --init-scale-hopfield=8 \ --workers=8 \ --model="RN50" \ --dist-url="tcp://" \ --batch-size-eval=512 ">
python -u src/training/ \
--warmup 20000 \
--batch-size=128 \
--lr=1e-3 \
--wd=0.1 \
--lr-scheduler="cosine-restarts" \
--restart-cycles=10 \
--epochs=70 \
--method="cloob" \
--init-inv-tau=30 \
--init-scale-hopfield=8 \
--workers=8 \
--model="RN50" \
--dist-url="tcp://" \

Zeroshot evaluation of downstream tasks

We provide a Jupyter notebook to perform zeroshot evaluation with a trained model.



  • CLOOB for text-to-image search?

    CLOOB for text-to-image search?

    So I implemented a text-to-image search where I query a text - text encoder then image through the image encoder and retrieve top images for the query but it doesn't work well with CLOOB?

    What is the main reasoning behind this?

    opened by animemes-bot 2
  • Error in zeroshot notebook with different checkpoint

    Error in zeroshot notebook with different checkpoint

    I changed the checkpoint to RN50x4 and I got an error in run(model, classifier, ...) function about mismatch sizes (below). Any idea what could be the issue?

    Calculating the text embeddings for all classes of the dataset
    100%|██████████| 500/500 [00:02<00:00, 171.59it/s]Calculating the image embeddings for all images of the dataset
      0%|          | 0/8 [00:50<?, ?it/s]
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-41-9989560228e7> in <module>()
          4 classifier = zero_shot_classifier(model, classnames, prompt_templates, device)
          5 print("Calculating the image embeddings for all images of the dataset", flush=True)
    ----> 6 accuracy = run(model, classifier, dataloader, device, accuracy_score)
          7 print('Zeroshot accuracy: ', accuracy.round(2))
    5 frames
    <ipython-input-32-7f9a8b257980> in run(model, classifier, dataloader, device, accuracy_metric)
         23             # predict
    ---> 24             image_features = model.encode_image(images)
         25             image_features /= image_features.norm(dim=-1, keepdim=True)
         26             logits = image_features @ classifier
    /content/drive/My Drive/MLExperiments/cloob/src/clip/ in encode_image(self, image)
        389     def encode_image(self, image):
    --> 390         return self.visual(image.type(self.dtype))
        392     def encode_text(self, text):
    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/ in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    /content/drive/My Drive/MLExperiments/cloob/src/clip/ in forward(self, x)
        162         x = self.layer3(x)
        163         x = self.layer4(x)
    --> 164         x = self.attnpool(x)
        166         return x
    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/ in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    /content/drive/My Drive/MLExperiments/cloob/src/clip/ in forward(self, x)
         69         x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3]).permute(2, 0, 1)  # NCHW -> (HW)NC
         70         x =[x.mean(dim=0, keepdim=True), x], dim=0)  # (HW+1)NC
    ---> 71         x = x + self.positional_embedding[:, None, :].to(x.dtype)  # (HW+1)NC
         72         x, _ = F.multi_head_attention_forward(
         73             query=x, key=x, value=x,
    RuntimeError: The size of tensor a (50) must match the size of tensor b (82) at non-singleton dimension 0
    opened by nikky4D 2
  • More checkpoints and zero shot comparison with clip

    More checkpoints and zero shot comparison with clip


    Are there any more checkpoints? In the model_configs folder, I see several jsons but we can only download the RN50 and RN50x4 checkpoints.

    Also, is it possible to add zero shot comparison with original clip models?

    opened by nikky4D 2
  • What do the training acc/loss graphs look like?

    What do the training acc/loss graphs look like?


    I am trying to train a similar architecture in a self-supervised way. However, for the pre-text task the loss plateaus relatively quickly. What are the various graphs that were generated when this architecture was trained, both for accuracy and loss? Thank you.

    opened by rohan-mehta-1024 1
  • Availability of pretrained models

    Availability of pretrained models

    Great work! Is it possible to have both pretrained model and configuration files to test the notebook? I found models and datasets in: but when I run the notebook I need the config file RN50.json

    Thank you very much

    opened by EnricoBeltramo 1
  • Existing Modern Hopfield Repository Not Used?

    Existing Modern Hopfield Repository Not Used?

    Why is the existing code for the Modern Hopfield Net (which has its own Github repository) not used here? And if I wanted to use it instead, what arguments would I have to call it with to get the same result as here?

    opened by rohan-mehta-1024 1
  • Error in setup conda env:

    Error in setup conda env: "Solving environment: failed"

    Hi, after cloning the github repository I ran the command "conda env create --file environment.yml" in my conda base environment. The environment is not generated and the following error appears:

    Collecting package metadata (repodata.json): done
    Solving environment: failed
      - lz4-c==1.9.3=h9c3ff4c_1
      - _openmp_mutex==4.5=1_llvm

    Looking at the conda version (conda --version), results in version 4.11.0. How can I do to correctly install the environment? Thanks in advice

    opened by jek28 0
  • Implementing CUML-based linear probing

    Implementing CUML-based linear probing

    The CLOOB paper mentioned that it used CUML-based logistic regression with L-BFGS algorithm to utilize GPUs for efficiency. My implementation works fine on small datasets (e.g., CIFAR), but CUDA out of memory occurred when dealing with large-scale ImageNet.

    I have been stuck here for a pretty long time, and I cannot find useful support from the document or the Internet. Is it possible to provide a few code examples highlighting how to fix this problem?

    opened by ChenDelong1999 0
Institute for Machine Learning, Johannes Kepler University Linz
Software of the Institute for Machine Learning, JKU Linz
Institute for Machine Learning, Johannes Kepler University Linz
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

CLIP-GEN [简体中文][English] 本项目在萤火二号集群上用 PyTorch 实现了论文 《CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP》。 CLIP-GEN 是一个 Language-F

null 75 Dec 29, 2022
Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Deep Daze mist over green hills shattered plates on the grass cosmic love and attention a time traveler in the crowd life during the plague meditative

Phil Wang 4.4k Jan 3, 2023
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

Federico Galatolo 172 Dec 22, 2022
CLIP+FFT text-to-image

Aphantasia This is a text-to-image tool, part of the artwork of the same name. Based on CLIP model, with FFT parameterizer from Lucent library as a ge

vadim epstein 690 Jan 2, 2023
Navigating StyleGAN2 w latent space using CLIP

Navigating StyleGAN2 w latent space using CLIP an attempt to build sth with the official SG2-ADA Pytorch impl kinda inspired by Generating Images from

Mike K. 55 Dec 6, 2022
RANZCR-CLiP 7th Place Solution

RANZCR-CLiP 7th Place Solution This repository is WIP. (18 Mar 2021) Installation git clone

Hiroshechka Y 21 Oct 22, 2022
A containerized REST API around OpenAI's CLIP model.

OpenAI's CLIP — REST API This is a container wrapping OpenAI's CLIP model in a RESTful interface. Running the container locally First, build the conta

Santiago Valdarrama 48 Nov 6, 2022
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

Ryan Murdock has done it again, combining OpenAI's CLIP and the generator from a BigGAN! This repository wraps up his work so it is easily accessible to anyone who owns a GPU.

Phil Wang 2.3k Jan 9, 2023
Simple implementation of OpenAI CLIP model in PyTorch.

It was in January of 2021 that OpenAI announced two new models: DALL-E and CLIP, both multi-modality models connecting texts and images in some way. In this article we are going to implement CLIP model from scratch in PyTorch. OpenAI has open-sourced some of the code relating to CLIP model but I found it intimidating and it was far from something short and simple. I also came across a good tutorial inspired by CLIP model on Keras code examples and I translated some parts of it into PyTorch to build this tutorial totally with our beloved PyTorch!

Moein Shariatnia 226 Jan 5, 2023
A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

train-CLIP ?? A PyTorch Lightning solution to training CLIP from scratch. Goal ⚽ Our aim is to create an easy to use Lightning implementation of OpenA

Cade Gordon 396 Dec 30, 2022
CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Language–Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

Myeongjun Kim 52 Jan 7, 2023
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

null 458 Jan 2, 2023
Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Nerdy Rodent 2.3k Jan 4, 2023
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
An open source implementation of CLIP.

OpenCLIP Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training). The goal of this repository is to enable

null 2.7k Dec 31, 2022
Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP Abstract: We introduce a method that allows to automatically se

Daniil Pakhomov 134 Dec 19, 2022
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

Kevin Costa 73 Sep 11, 2022
CLIP + VQGAN / PixelDraw

clipit Yet Another VQGAN-CLIP Codebase This started as a fork of @nerdyrodent's VQGAN-CLIP code which was based on the notebooks of @RiversWithWings a

dribnet 276 Dec 12, 2022
Streamlit Tutorial (ex: stock price dashboard, cartoon-stylegan, vqgan-clip, stylemixing, styleclip, sefa)

Streamlit Tutorials Install pip install streamlit Run cd [directory] streamlit run --server.address --server.port [your port] # http:/

Jihye Back 30 Jan 6, 2023