CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Institute for Machine Learning, Johannes Kepler University Linz

Last update: Jan 4, 2023

Related tags

Deep Learning cloob

Overview

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Andreas Fürst^{* 1}, Elisabeth Rumetshofer^{* 1}, Viet Tran¹, Hubert Ramsauer¹, Fei Tang³, Johannes Lehner¹, David Kreil², Michael Kopp², Günter Klambauer¹, Angela Bitto-Nemling¹, Sepp Hochreiter^{1 2}

¹ ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria
² Institute of Advanced Research in Artificial Intelligence (IARAI)
³ HERE Technologies
^* Equal contribution

Detailed blog post on this paper at this link.

The full paper is available here.

Implementation of CLOOB

This repository contains the implemenation of CLOOB used to obtain the results reported in the paper. The implementation is based on OpenCLIP, an open source implementation of OpenAI's CLIP.

Setup

We provide an 'environment.yml' file to set up a conda environment with all required packages. Run the following command to clone the repository and create the environment.

# Clone repository and swtich into the directory
git clone https://github.com/ml-jku/cloob
cd cloob

# Create the environment and activate it
conda env create --file environment.yml
conda activate cloob

# Additionally, webdataset needs to be installed from git repo for pre-training on YFCC 
pip install git+https://github.com/tmbdev/webdataset.git

# Add the directory to the PYTHONPATH environment variable
export PYTHONPATH="$PYTHONPATH:$PWD/src"

Data

For pre-training we use the two datasets supported by OpenCLIP, namely Conceptual Captions and YFCC.

Conceptual Captions

OpenCLIP already provides a script to download and prepare the Conceptual Captions dataset, which contains 2.89M training images and 13k validation images. First, download the Conceptual Captions URLs and then run the script gather_cc.py.

python3 src/data/gather_cc.py path/to/Train_GCC-training.tsv path/to/Validation_GCC-1.1.0-Validation.tsv

YFCC

We use the same subset of ~15M images from the YFCC100M dataset as CLIP. They provide a list of (line number, photo identifier, photo hash) of each image contained in this subset here.

For more information see YFCC100m Subset on OpenAI's github.

Downstream Tasks

In the paper we report results on several downstream tasks. Except for ImageNet we provide links to already pre-processed versions (where necessary) of the respective test set.

Dataset	Description	Official	Processed
Birdsnap	This dataset contains images of North American bird species, however our dataset is smaller than reported in CLIP as some samples are no longer available.	Link	Link
Country211	This dataset was published in CLIP and is a small subset of the YFCC100m dataset. It consists of photos that can be assigned to 211 countries via GPS coordinates. For each country 200 photos are sampled for the training set and 100 for testing.	Link	Link
Flowers102	Images of 102 flower categories commonly occuring in the United Kingdom were collected. Several classes are very similar and there is a large variation in scale, pose and lighting.	Link	Link
GTSRB	This dataset was released for a challenge held at the IJCNN 2011. The dataset contains images of german traffic signs from more than 40 classes.	Link	Link
Stanford Cars	This dataset contains images of 196 car models at the level of make, model and year (e.g. Tesla Model S Sedan 2012).	Link	Link
UCF101	The dataset has been created by extracting the middle frame from each video.	Link	Link
ImageNet	This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images.	Link	-
ImageNet v2	The ImageNetV2 dataset contains new test data for the ImageNet benchmark.	Link	-

Usage

In the following there is an example command for pretraining on CC with an effective batch size of 512 when used on 4 GPUs.

 
  /conceptual_captions/Train-GCC-training_output.csv" \ --val-data="
  
   /conceptual_captions/Validation_GCC-1.1.0-Validation_output.csv" \ --path-data="
   
    /conceptual_captions" \ --imagenet-val="
    
     /imagenet/val" \ --warmup 20000 \ --batch-size=128 \ --lr=1e-3 \ --wd=0.1 \ --lr-scheduler="cosine-restarts" \ --restart-cycles=10 \ --epochs=70 \ --method="cloob" \ --init-inv-tau=30 \ --init-scale-hopfield=8 \ --workers=8 \ --model="RN50" \ --dist-url="tcp://127.0.0.1:6100" \ --batch-size-eval=512 "> 
     python -u src/training/main.py \
--train-data="
       
        /conceptual_captions/Train-GCC-training_output.csv
        "
        \
--val-data="
       
        /conceptual_captions/Validation_GCC-1.1.0-Validation_output.csv
        "
        \
--path-data="
       
        /conceptual_captions
        "
        \
--imagenet-val="
       
        /imagenet/val
        "
        \
--warmup 20000 \
--batch-size=128 \
--lr=1e-3 \
--wd=0.1 \
--lr-scheduler="cosine-restarts" \
--restart-cycles=10 \
--epochs=70 \
--method="cloob" \
--init-inv-tau=30 \
--init-scale-hopfield=8 \
--workers=8 \
--model="RN50" \
--dist-url="tcp://127.0.0.1:6100" \
--batch-size-eval=512 
    
   
  
 

Zeroshot evaluation of downstream tasks

We provide a Jupyter notebook to perform zeroshot evaluation with a trained model.

LICENSE

MIT LICENSE

Comments

CLOOB for text-to-image search?

So I implemented a text-to-image search where I query a text - text encoder then image through the image encoder and retrieve top images for the query but it doesn't work well with CLOOB?

What is the main reasoning behind this?

opened by animemes-bot 2

Error in zeroshot notebook with different checkpoint

I changed the checkpoint to RN50x4 and I got an error in run(model, classifier, ...) function about mismatch sizes (below). Any idea what could be the issue?

Calculating the text embeddings for all classes of the dataset
100%|██████████| 500/500 [00:02<00:00, 171.59it/s]Calculating the image embeddings for all images of the dataset

  0%|          | 0/8 [00:50<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-41-9989560228e7> in <module>()
      4 classifier = zero_shot_classifier(model, classnames, prompt_templates, device)
      5 print("Calculating the image embeddings for all images of the dataset", flush=True)
----> 6 accuracy = run(model, classifier, dataloader, device, accuracy_score)
      7 print('Zeroshot accuracy: ', accuracy.round(2))

5 frames
<ipython-input-32-7f9a8b257980> in run(model, classifier, dataloader, device, accuracy_metric)
     22 
     23             # predict
---> 24             image_features = model.encode_image(images)
     25             image_features /= image_features.norm(dim=-1, keepdim=True)
     26             logits = image_features @ classifier

/content/drive/My Drive/MLExperiments/cloob/src/clip/model.py in encode_image(self, image)
    388 
    389     def encode_image(self, image):
--> 390         return self.visual(image.type(self.dtype))
    391 
    392     def encode_text(self, text):

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/content/drive/My Drive/MLExperiments/cloob/src/clip/model.py in forward(self, x)
    162         x = self.layer3(x)
    163         x = self.layer4(x)
--> 164         x = self.attnpool(x)
    165 
    166         return x

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/content/drive/My Drive/MLExperiments/cloob/src/clip/model.py in forward(self, x)
     69         x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3]).permute(2, 0, 1)  # NCHW -> (HW)NC
     70         x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0)  # (HW+1)NC
---> 71         x = x + self.positional_embedding[:, None, :].to(x.dtype)  # (HW+1)NC
     72         x, _ = F.multi_head_attention_forward(
     73             query=x, key=x, value=x,

RuntimeError: The size of tensor a (50) must match the size of tensor b (82) at non-singleton dimension 0

opened by nikky4D 2

More checkpoints and zero shot comparison with clip

Hi,

Are there any more checkpoints? In the model_configs folder, I see several jsons but we can only download the RN50 and RN50x4 checkpoints.

Also, is it possible to add zero shot comparison with original clip models?

opened by nikky4D 2
What do the training acc/loss graphs look like?

Hello,

I am trying to train a similar architecture in a self-supervised way. However, for the pre-text task the loss plateaus relatively quickly. What are the various graphs that were generated when this architecture was trained, both for accuracy and loss? Thank you.

opened by rohan-mehta-1024 1
Availability of pretrained models

Great work! Is it possible to have both pretrained model and configuration files to test the notebook? I found models and datasets in: https://ml.jku.at/research/CLOOB/downloads but when I run the notebook I need the config file RN50.json

Thank you very much

opened by EnricoBeltramo 1
Existing Modern Hopfield Repository Not Used?

Why is the existing code for the Modern Hopfield Net (which has its own Github repository) not used here? And if I wanted to use it instead, what arguments would I have to call it with to get the same result as here?

opened by rohan-mehta-1024 1
Error in setup conda env: "Solving environment: failed"
Hi, after cloning the github repository I ran the command "conda env create --file environment.yml" in my conda base environment. The environment is not generated and the following error appears:

Collecting package metadata (repodata.json): done Solving environment: failed ResolvePackageNotFound: - lz4-c==1.9.3=h9c3ff4c_1 - _openmp_mutex==4.5=1_llvm ...

Looking at the conda version (conda --version), results in version 4.11.0. How can I do to correctly install the environment? Thanks in advice
opened by jek28 0
Implementing CUML-based linear probing

The CLOOB paper mentioned that it used CUML-based logistic regression with L-BFGS algorithm to utilize GPUs for efficiency. My implementation works fine on small datasets (e.g., CIFAR), but CUDA out of memory occurred when dealing with large-scale ImageNet.

I have been stuck here for a pretty long time, and I cannot find useful support from the document or the Internet. Is it possible to provide a few code examples highlighting how to fix this problem?

opened by ChenDelong1999 0

Owner

Institute for Machine Learning, Johannes Kepler University Linz

Software of the Institute for Machine Learning, JKU Linz

GitHub

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

CLIP-GEN [简体中文][English] 本项目在萤火二号集群上用 PyTorch 实现了论文《CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP》。 CLIP-GEN 是一个 Language-F

75 Dec 29, 2022

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Deep Daze mist over green hills shattered plates on the grass cosmic love and attention a time traveler in the crowd life during the plague meditative

4.4k Jan 3, 2023

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

172 Dec 22, 2022

CLIP+FFT text-to-image

Aphantasia This is a text-to-image tool, part of the artwork of the same name. Based on CLIP model, with FFT parameterizer from Lucent library as a ge

690 Jan 2, 2023

Navigating StyleGAN2 w latent space using CLIP

Navigating StyleGAN2 w latent space using CLIP an attempt to build sth with the official SG2-ADA Pytorch impl kinda inspired by Generating Images from

55 Dec 6, 2022

RANZCR-CLiP 7th Place Solution

RANZCR-CLiP 7th Place Solution This repository is WIP. (18 Mar 2021) Installation git clone https://github.com/analokmaus/kaggle-ranzcr-clip-public.gi

21 Oct 22, 2022

A containerized REST API around OpenAI's CLIP model.

OpenAI's CLIP — REST API This is a container wrapping OpenAI's CLIP model in a RESTful interface. Running the container locally First, build the conta

48 Nov 6, 2022

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

Ryan Murdock has done it again, combining OpenAI's CLIP and the generator from a BigGAN! This repository wraps up his work so it is easily accessible to anyone who owns a GPU.

2.3k Jan 9, 2023

Simple implementation of OpenAI CLIP model in PyTorch.

It was in January of 2021 that OpenAI announced two new models: DALL-E and CLIP, both multi-modality models connecting texts and images in some way. In this article we are going to implement CLIP model from scratch in PyTorch. OpenAI has open-sourced some of the code relating to CLIP model but I found it intimidating and it was far from something short and simple. I also came across a good tutorial inspired by CLIP model on Keras code examples and I translated some parts of it into PyTorch to build this tutorial totally with our beloved PyTorch!

226 Jan 5, 2023

A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

train-CLIP ?? A PyTorch Lightning solution to training CLIP from scratch. Goal ⚽ Our aim is to create an easy to use Lightning implementation of OpenA

396 Dec 30, 2022

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Language–Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

52 Jan 7, 2023

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 2, 2023

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

2.3k Jan 4, 2023

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Related tags

Overview

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Detailed blog post on this paper at this link.

The full paper is available here.

Implementation of CLOOB

Setup

Data

Conceptual Captions

YFCC

Downstream Tasks

Usage

Zeroshot evaluation of downstream tasks

LICENSE

Comments

Owner

Institute for Machine Learning, Johannes Kepler University Linz

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP+FFT text-to-image

Navigating StyleGAN2 w latent space using CLIP

RANZCR-CLiP 7th Place Solution

A containerized REST API around OpenAI's CLIP model.

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

Simple implementation of OpenAI CLIP model in PyTorch.

A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

An open source implementation of CLIP.

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

CLIP + VQGAN / PixelDraw

Streamlit Tutorial (ex: stock price dashboard, cartoon-stylegan, vqgan-clip, stylemixing, styleclip, sefa)