Contrastive Language-Image Pretraining

Related tags

Deep Learning CLIP
Overview

CLIP

[Blog] [Paper] [Model Card] [Colab]

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.

Approach

CLIP

Usage

First, install PyTorch 1.7.1 and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git

Replace cudatoolkit=11.0 above with the appropriate CUDA version on your machine or cpuonly when installing on a machine without a GPU.

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

API

The CLIP module clip provides the following methods:

clip.available_models()

Returns the names of the available CLIP models.

clip.load(name, device=..., jit=False)

Returns the model and the TorchVision transform needed by the model, specified by the model name returned by clip.available_models(). It will download the model as necessary. The name argument can also be a path to a local checkpoint.

The device to run the model can be optionally specified, and the default is to use the first CUDA device if there is any, otherwise the CPU. When jit is False, a non-JIT version of the model will be loaded.

clip.tokenize(text: Union[str, List[str]], context_length=77)

Returns a LongTensor containing tokenized sequences of given text input(s). This can be used as the input to the model


The model returned by clip.load() supports the following methods:

model.encode_image(image: Tensor)

Given a batch of images, returns the image features encoded by the vision portion of the CLIP model.

model.encode_text(text: Tensor)

Given a batch of text tokens, returns the text features encoded by the language portion of the CLIP model.

model(image: Tensor, text: Tensor)

Given a batch of images and a batch of text tokens, returns two Tensors, containing the logit scores corresponding to each image and text input. The values are cosine similarities between the corresponding image and text features, times 100.

More Examples

Zero-Shot Prediction

The code below performs zero-shot prediction using CLIP, as shown in Appendix B in the paper. This example takes an image from the CIFAR-100 dataset, and predicts the most likely labels among the 100 textual labels from the dataset.

import os
import clip
import torch
from torchvision.datasets import CIFAR100

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

The output will look like the following (the exact numbers may be slightly different depending on the compute device):

Top predictions:

           snake: 65.31%
          turtle: 12.29%
    sweet_pepper: 3.83%
          lizard: 1.88%
       crocodile: 1.75%

Note that this example uses the encode_image() and encode_text() methods that return the encoded features of given inputs.

Linear-probe evaluation

The example below uses scikit-learn to perform logistic regression on image features.

import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)


def get_features(dataset):
    all_features = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

            all_features.append(features)
            all_labels.append(labels)

    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)

# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(np.float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

Note that the C value should be determined via a hyperparameter sweep using a validation split.

Comments
  • Image similarity?

    Image similarity?

    Incredible work as always you guys! In looking at the Colab, it seems it's possible to do image-to-text similarity but I'm curious if it's possible to compare image similarity as well.

    For instance, if I just replace 'text_features' with 'image_features' would that work / be the best way to do this?

    image_features /= image_features.norm(dim=-1, keepdim=True)
    image_2_features /= image_2_features.norm(dim=-1, keepdim=True)
    similarity = image_2_features.cpu().numpy() @ image_features.cpu().numpy().T
    
    opened by youssefavx 22
  • Bigger models release ?

    Bigger models release ?

    Hi, Thanks for these amazing results and for releasing the code and ViT-B/32 weights! Do you plan to also release the 3 bigger models you mention in the paper ?

    opened by rom1504 19
  • Best practice for mix precision training

    Best practice for mix precision training

    I am trying to fine-tune CLIP models on new datasets. What's the best practice for mix precision training?

    Using Adam, I got errors either nan or inf since attribute eps is hard to specify for Half and float32. My walk around is to divide parameters into two groups and specify different eps. Any better solutions?

    opened by qingerVT 17
  • KITTI dataset

    KITTI dataset

    Kitti dataset has a lot information annotated, as I found here Tensorflow kitti. It's mentioned that the task is to recognize the distance to the nearest car. However, I'm unable to locate anything with the same class number (4) in Table 9. Could you please disclose more details about what ground truth you use in linear probe?

    opened by meigaoms 12
  • NAN in downstream detection finetuneing task

    NAN in downstream detection finetuneing task

    I try to use you RN50.pt as a backbone in Faster-RCNN-coco-detection task in mmdetection arichtecture. Firstly, I try to modified the layer name of RN50.pt and resaved them so that the parameters can be laoded correctly. Then, I modified ResNetV1d to match your ModifiedResNet in clip model. During the fineturning procedure, I do not use your attention layer and just use the output of reslayer4,reslayer3,reslayer2 reslayer1 to FPN.

    opened by launchauto 11
  • Training CLIP-ViT

    Training CLIP-ViT

    @jongwook Thanks for this great work!

    I am trying to train CLIP VIT B/32 from scratch, but cannot get a higher score on imagenet versus CLIP resnet-50. May I ask what initialization you use in training VIT?

    In the paper: We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.

    opened by Meteorix 9
  • About the ImageNet zero-shot performance with the released models

    About the ImageNet zero-shot performance with the released models

    Hi, CLIP authors,

    Really great work! Appreciate much for releasing the code!

    Recently, I am trying to evaluate the released two models (RN50 and ViT-B/32.) on imagenet validation set. What I can get with prompt engineering without ensemble are shown below:

    ResNet-50 top-1: 55.09, top-5: 83.59 ViT-B/32 top-1: 59.06, top-5: 85.59

    Not sure whether these numbers match those on your side. As a reference for us to do trial-and-errors, can you report the validation accuracies for these two models?

    thanks, Jianwei

    opened by jwyang 9
  • NotImplementedError when clip.load

    NotImplementedError when clip.load

    When running model, preprocess = clip.load("ViT-B/32", device=device)

    D:\Anaconda\anaconda3\envs\A\lib\site-packages\torch\serialization.py:591: UserWarning: 'torch.load' received a zip file that looks like a TorchScript archive dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to silence this warning) " silence this warning)", UserWarning) Traceback (most recent call last): File "D:\python\helpers\pydev\pydevd.py", line 1483, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "D:\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "D:\models.py", line 191, in model = CLIP(cfg) File "D:\models.py", line 173, in init self.model, self.preprocess = clip.load("ViT-B/32", device=device) File "D:\Anaconda\anaconda3\envs\A\lib\site-packages\clip\clip.py", line 135, in load model = build_model(state_dict or model.state_dict()).to(device) File "D:\Anaconda\anaconda3\envs\A\lib\site-packages\clip\model.py", line 396, in build_model vit = "visual.proj" in state_dict File "D:\Anaconda\anaconda3\envs\A\lib\site-packages\torch\jit_script.py", line 624, in contains return self.forward_magic_method("contains", key) File "D:\Anaconda\anaconda3\envs\A\lib\site-packages\torch\jit_script.py", line 611, in forward_magic_method raise NotImplementedError() NotImplementedError

    How do I solve this problem?

    opened by DtYXs 8
  • Is there a way to only use the text encoder ?

    Is there a way to only use the text encoder ?

    Hey! I'd like to use only one part of the model, specifically the text encoder in my work. I don't want to store the whole model in GPU memory just to use the text encoding part, is there a simple way to do that? or will I have to dive into the code myself

    Thanks for the help ! :)

    opened by ranran9991 8
  • NaN values after a single gradient step

    NaN values after a single gradient step

    Hi!

    Using PyTorch 1.7.1, I get NaN values after a single parameter update:

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import clip
    
    class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.model, _ = clip.load('RN50')
    
        def forward(self, imgs, tokens):
            image_features = self.model.encode_image(imgs)
            match_text_features = self.model.encode_text(tokens)
    
            image_features = image_features /  image_features.norm(dim=-1, keepdim=True)
            match_text_features = match_text_features / match_text_features.norm(dim=-1, keepdim=True)
    
            similarity_match = image_features @ match_text_features.T
            return similarity_match
    
    def compute_loss(similarity_match, labels):
        loss1 = F.cross_entropy(similarity_match, labels)
        loss2 = F.cross_entropy(similarity_match.T, labels)
        loss = (loss1 + loss2) / 2
        return loss
    
    model = Model().cuda()
    optimizer = torch.optim.Adam(model.parameters())
    
    imgs = torch.randn(8, 3, 224, 224).cuda()
    tokens = torch.randint(high=1000, size=(8, 77)).cuda()
    labels = torch.arange(8).cuda()
    
    similarity_match = model(imgs, tokens)
    loss = compute_loss(similarity_match, labels)
    loss.backward()
    optimizer.step()
    
    print(model(imgs, tokens))
    

    Output:

           [nan, nan, nan, nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan, nan, nan, nan]], device='cuda:0',
          dtype=torch.float16, grad_fn=<MmBackward>)
    
    opened by MartinPernus 8
  • Hyperparameter sweep in Evaluation (linear probe)

    Hyperparameter sweep in Evaluation (linear probe)

    Hi there, I'm trying to reproduce evaluation scores in this paper, particularly table 10. A.3. Evaluation in Page 38 mentioned L2 regularization strength lambda is determined with a hyperparameter sweep. (1) Only maximum 1,000 iterations is mentioned in L-BFGS. Do other parameters matter, like the learning rate? (2) For parametric binary search, is the cost function monotonic to lambda?

    Thank you!

    opened by meigaoms 8
  • Why does CLIP always need softmax and not simple Cosine Similarity

    Why does CLIP always need softmax and not simple Cosine Similarity

    I would like to use CLIP embeddings for text and images in elastic search. It turns out that CLIP embeddings always need at least 2 text inputs for every image, and it does softmax on that. Is there a way to regenerate embeddings so that I can directly use simple cosine similarity between one text input and one image input? Doing softmax in elastic search for 2 text inputs and one image embedding at run time is complicated and expensive. For other embeddings like BERT , we can directly use cosine similarity

    Here is sample code. How can I avoid softmax at runtime and just use one text input per image?

    with torch.no_grad(): image_features = model.encode_image(image_input) text_features = model.encode_text(text_inputs) image_features /= image_features.norm(dim=-1, keepdim=True) print(image_features.shape) print(text_features.shape) text_features /= text_features.norm(dim=-1, keepdim=True) similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy() print(similarity) similarity = np.exp(similarity) / np.sum(np.exp(similarity), axis=0)

    opened by evergreenllc2020 0
  • imagenet-21k class name

    imagenet-21k class name

    Thanks for your inspiring and effective work. As the paper mentioned, you treat classification on imagenet-1k as retrieval and you did very meticulous work on the class names for prompt engineering(polysemy). Now we hope to do retrieval on imagenet-21k, could you please share your class name mapping txt for the dataset if possible? Or do you have any advice for the work, is manual inspection the only method to convert the dataset?

    Looking forward to your reply.

    opened by amandaluof 0
  • UCF-101 train-test split

    UCF-101 train-test split

    In CLIP published paper the reported train-test split size is: train = 9,537 test = 1,794

    While in the UCF-101 original repo(https://www.crcv.ucf.edu/data/UCF101.php#:~:text=UCF101%20is%20an%20action%20recognition,which%20has%2050%20action%20categories.) size given is: train = 9537 test = 3783

    Could you please point me to the train and test split you have used to report results in the paper?

    opened by owaisCS 0
  • Embedding sequence of images into CLIP

    Embedding sequence of images into CLIP

    Hi,

    Is there a way to use CLIP to embed whole albums of photos and then check similarity with certain phrases on what the album is about? I have a hard time to think about how I would encode a sequence of images into CLIP? Maybe there are some papers about it out there?

    Thanks for any suggestion!

    opened by justlike-prog 0
  • why tokenized_prompts.argmax =49407,'<|endoftext|>'

    why tokenized_prompts.argmax =49407,'<|endoftext|>'

    Can <|endoftext|> represent global information of tokenized_prompts?why tokenized_prompts.argmax(dim=-1) '<|endoftext|>': 49407 like cls_token of transformer? Thanks

    opened by Harzva 1
Owner
OpenAI
OpenAI
[NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning

SoCo [NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning By Fangyun Wei*, Yue Gao*, Zhirong Wu, Han Hu,

Yue Gao 139 Dec 14, 2022
magiCARP: Contrastive Authoring+Reviewing Pretraining

magiCARP: Contrastive Authoring+Reviewing Pretraining Welcome to the magiCARP API, the test bed used by EleutherAI for performing text/text bi-encoder

EleutherAI 43 Dec 29, 2022
Saeed Lotfi 28 Dec 12, 2022
EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling This is the official implementation for "Frustratingly Simple Pretraining Al

Atsuki Yamaguchi 31 Nov 18, 2022
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
Code for generating a single image pretraining dataset

Single Image Pretraining of Visual Representations As shown in the paper A critical analysis of self-supervision, or what we can learn from a single i

Yuki M. Asano 12 Dec 19, 2022
Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)

Noise Contrastive Estimation for pyTorch Overview This repository contains a re-implementation of the Noise Contrastive Estimation algorithm, implemen

Denis Emelin 42 Nov 24, 2022
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

Sense-GVT 470 Dec 30, 2022
CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Galuh 17 Mar 10, 2022
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks [Paper] [Project Website] This repository holds the source code, pretra

Humam Alwassel 83 Dec 21, 2022
Official Pytorch Implementation of: "ImageNet-21K Pretraining for the Masses"(2021) paper

ImageNet-21K Pretraining for the Masses Paper | Pretrained models Official PyTorch Implementation Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, Lihi Zelni

null 574 Jan 2, 2023
[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

SapBERT: Self-alignment pretraining for BERT This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining

Cambridge Language Technology Lab 104 Dec 7, 2022
When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings This is the repository for t

RegLab 39 Jan 7, 2023
Pretraining Representations For Data-Efficient Reinforcement Learning

Pretraining Representations For Data-Efficient Reinforcement Learning Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, Ankesh Anand, Laurent Ch

Mila 40 Dec 11, 2022
ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information This repository contains code, model, dataset for ChineseBERT at ACL2021. Ch

null 413 Dec 1, 2022
DETReg: Unsupervised Pretraining with Region Priors for Object Detection

DETReg: Unsupervised Pretraining with Region Priors for Object Detection Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik

Amir Bar 283 Dec 27, 2022
Does Pretraining for Summarization Reuqire Knowledge Transfer?

Pretraining summarization models using a corpus of nonsense

Approximately Correct Machine Intelligence (ACMI) Lab 12 Dec 19, 2022
The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

Yuki M. Asano 249 Dec 22, 2022
Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation Introduction Getting Started FSD50K Recipe AudioSet Recipe Label E

Yuan Gong 84 Dec 27, 2022