A concise but complete implementation of CLIP with various experimental improvements from recent papers

Overview

x-clip (wip)

A concise but complete implementation of CLIP with various experimental improvements from recent papers

Install

$ pip install x-clip

Usage

import torch
from x_clip import CLIP

clip = CLIP(
    dim_text = 512,
    dim_image = 512,
    dim_latent = 512,
    num_text_tokens = 10000,
    text_enc_depth = 6,
    text_seq_len = 256,
    text_heads = 8,
    num_visual_tokens = 512,
    visual_enc_depth = 6,
    visual_image_size = 256,
    visual_patch_size = 32,
    visual_heads = 8,
    use_all_token_embeds = True   # whether to use fine-grained contrastive learning (FILIP)
)

text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
mask = torch.ones_like(text).bool()

loss = clip(text, images, text_mask = mask, return_loss = True)
loss.backward()

Citations

@misc{radford2021learning,
    title   = {Learning Transferable Visual Models From Natural Language Supervision}, 
    author  = {Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
    year    = {2021},
    eprint  = {2103.00020},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
@misc{yao2021filip,
    title   = {FILIP: Fine-grained Interactive Language-Image Pre-Training}, 
    author  = {Lewei Yao and Runhui Huang and Lu Hou and Guansong Lu and Minzhe Niu and Hang Xu and Xiaodan Liang and Zhenguo Li and Xin Jiang and Chunjing Xu},
    year    = {2021},
    eprint  = {2111.07783},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
Comments
  • Model forward outputs to text/image similarity score

    Model forward outputs to text/image similarity score

    Any insight on how to take the image/text embeddings (or nominal model forward output) to achieve a simple similarity score as done in the huggingface implementation? HF example here

    In the original paper I see the dot products of the image/text encoder outputs were used, but here I was having troubles with the dimensions on the outputs.

    opened by paulcjh 12
  • Using different encoders in CLIP

    Using different encoders in CLIP

    Hi, I am wondering if it was possible to use different encoders in CLIP ? For images not using vit but resnet for example. And is it possible to replace the text encoder by a features encoder for example ? If I have a vector of features for a given image and I want to use x-clip how should I do that ? I have made a code example that doesnt seems to work, here is what I did:

    import torch
    from x_clip import CLIP
    import torch.nn as nn
    from torchvision import models
    
    class Image_Encoder(torch.nn.Module):
        #output size is (bs,512)
        def __init__(self):
            super(Image_Encoder, self).__init__()
            self.model_pre = models.resnet18(pretrained=False)
            self.base=nn.Sequential(*list(self.model_pre.children()))
            self.base[0]=nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
            self.resnet=self.base[:-1]
    
        def forward(self, x):
            out=self.resnet(x).squeeze()
            return out
    
    
    class features_encoder(torch.nn.Module):
        #output size is (bs,512)
        def __init__(self):
            super(features_encoder, self).__init__()
            self.model =nn.Linear(2048,512)
    
        def forward(self, x):
            out=self.model(x)
            return out
    
    images_encoder=Image_Encoder()
    features_encoder=features_encoder()
    
    clip = CLIP(
        image_encoder = images_encoder,
        text_encoder = features_encoder,
        dim_image = 512,
        dim_text = 512,
        dim_latent = 512
    )
    
    features= torch.randn(4,2048)
    images = torch.randn(4, 3, 256, 256)
    
    loss = clip(features, images, return_loss = True)
    loss.backward()
    

    but I got the following error : forward() takes 2 positional arguments but 3 were given

    Thanks

    opened by ethancohen123 8
  • Visual ssl with channels different than 3

    Visual ssl with channels different than 3

    Hi, seems to be a bug when trying to use visual ssl with a different number of channel than 3 . I think the error came from the visual ssl type ~row 280 here:

    #send a mock image tensor to instantiate parameters self.forward(torch.randn(1, 3, image_size, image_size))

    opened by ethancohen123 4
  • Allow other types of visual  SSL when initiating CLIP

    Allow other types of visual SSL when initiating CLIP

    In the following code as part of CLIP.__init__

            if use_visual_ssl:
                if visual_ssl_type == 'simsiam':
                    ssl_type = SimSiam
                elif visual_ssl_type == 'simclr':
                    ssl_type = partial(SimCLR, temperature = simclr_temperature)
                else:
                    raise ValueError(f'unknown visual_ssl_type')
    
                self.visual_ssl = ssl_type(
                    self.visual_transformer,
                    image_size = visual_image_size,
                    hidden_layer = visual_ssl_hidden_layer
                )
    

    the visual self-supervised learning is hardcoded. I would suggest changing this to accept the visual SSL module as an argument when instantiating CLIP to allow flexibility in the same manner as it does for the image encoder and text encoder.

    Example:

    barlow = BarlowTwins(augmentatation_fns)
    clip = CLIP(..., visual_ssl=barlow)
    
    opened by Froskekongen 4
  • Extract Text and Image Latents

    Extract Text and Image Latents

    Hi, in the current implementation we can only extract text and image embedding (by set return_encodings=True) which are obtained before applying latent linear layers. Isn't it better to add an option to extract latent embeddings? Another importance of this is that with the current code, it is impossible to extract the similarity matrix between a batch of images and a batch of text.

    opened by mmsamiei 2
  • NaN with mock data

    NaN with mock data

    Hi lucidrains,

    Try this and it will NaN within 100 steps (latest Github code). The loss looks fine before NaN.

    import torch
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cuda.matmul.allow_tf32 = True    
    torch.backends.cudnn.benchmark = True
    
    import random
    import numpy as np
    seed = 42
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    num_text_tokens = 10000
    batch_sz = 12
    text_seq_len = 256
    visual_image_size = 256
    
    # mock data
    
    data_sz = 1000
    all_text = torch.randint(0, num_text_tokens, (data_sz, text_seq_len)).cuda()
    all_images = torch.randn(data_sz, 3, visual_image_size, visual_image_size).cuda()
    
    text = torch.zeros((batch_sz, text_seq_len), dtype=torch.long).cuda()
    images = torch.zeros((batch_sz, 3, visual_image_size, visual_image_size)).cuda()
    
    ##########################################################################################
    
    import wandb
    import datetime
    wandb.init(project="Test", name=datetime.datetime.today().strftime('%Y-%m-%d-%H-%M-%S'), save_code=False)
    
    from x_clip import CLIP
    
    clip = CLIP(
        dim_text = 512,
        dim_image = 512,
        dim_latent = 512,
        num_text_tokens = num_text_tokens,
        text_enc_depth = 6,
        text_seq_len = text_seq_len,
        text_heads = 8,
        visual_enc_depth = 6,
        visual_image_size = visual_image_size,
        visual_patch_size = 32,
        visual_heads = 8,
        use_all_token_embeds = False,           # whether to use fine-grained contrastive learning (FILIP)
        decoupled_contrastive_learning = True,  # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
        extra_latent_projection = True,         # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
        use_visual_ssl = True,                  # whether to do self supervised learning on iages
        visual_ssl_type = 'simclr',             # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
        use_mlm = False,                        # use masked language learning (MLM) on text (DeCLIP)
        text_ssl_loss_weight = 0.05,            # weight for text MLM loss
        image_ssl_loss_weight = 0.05            # weight for image self-supervised learning loss
    ).cuda()
    
    optimizer = torch.optim.Adam(clip.parameters(), lr=1e-4, betas=(0.9, 0.99))
    
    for step in range(999999):
        for i in range(batch_sz):
            data_id = random.randrange(0, data_sz - 1)
            text[i] = all_text[data_id]
            images[i] = all_images[data_id]
    
        loss = clip(
            text,
            images,
            freeze_image_encoder = False,   # whether to freeze image encoder if using a pretrained image net, proposed by LiT paper
            return_loss = True              # needs to be set to True to return contrastive loss
        )
        clip.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(clip.parameters(), 1.0)
        optimizer.step()
    
        now_loss = loss.item()
        wandb.log({"loss": now_loss}, step = step)
        print(step, now_loss)
    
        if 'nan' in str(now_loss):
            break
    
    opened by BlinkDL 1
  • Unable to train to convergence (small dataset)

    Unable to train to convergence (small dataset)

    Hi nice work with x-clip. Hoping to play around with it and eventually combine it into your DALLE2 work.

    Currently having some trouble training on roughly 30k image-text pairs. Loss eventually goes negative and starts producing Nan's. I've dropped learning rate down (1e-4) and I'm clipping gradients (max_norm=0.5).

    Any thoughts on what are sane training params/configs on such a small dataset using x-clip?

    opened by jacobwjs 9
Releases(0.12.0)
Owner
Phil Wang
Working with Attention. It's all we need
Phil Wang
A simple but complete full-attention transformer with a set of promising experimental features from various papers

x-transformers A concise but fully-featured transformer, complete with a set of promising experimental features from various papers. Install $ pip ins

Phil Wang 2.3k Jan 3, 2023
arxiv-sanity, but very lite, simply providing the core value proposition of the ability to tag arxiv papers of interest and have the program recommend similar papers.

arxiv-sanity, but very lite, simply providing the core value proposition of the ability to tag arxiv papers of interest and have the program recommend similar papers.

Andrej 671 Dec 31, 2022
FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

XCL 191 Dec 31, 2022
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

CLIP-GEN [简体中文][English] 本项目在萤火二号集群上用 PyTorch 实现了论文 《CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP》。 CLIP-GEN 是一个 Language-F

null 75 Dec 29, 2022
Rainbow: Combining Improvements in Deep Reinforcement Learning

Rainbow Rainbow: Combining Improvements in Deep Reinforcement Learning [1]. Results and pretrained models can be found in the releases. DQN [2] Double

Kai Arulkumaran 1.4k Dec 29, 2022
Neural Scene Flow Fields using pytorch-lightning, with potential improvements

nsff_pl Neural Scene Flow Fields using pytorch-lightning. This repo reimplements the NSFF idea, but modifies several operations based on observation o

AI葵 178 Dec 21, 2022
The tl;dr on a few notable transformer/language model papers + other papers (alignment, memorization, etc).

The tl;dr on a few notable transformer/language model papers + other papers (alignment, memorization, etc).

Will Thompson 166 Jan 4, 2023
🍀 Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐

?? Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐

xmu-xiaoma66 7.7k Jan 5, 2023
DRLib:A concise deep reinforcement learning library, integrating HER and PER for almost off policy RL algos.

DRLib:A concise deep reinforcement learning library, integrating HER and PER for almost off policy RL algos A concise deep reinforcement learning libr

null 329 Jan 3, 2023
A clear, concise, simple yet powerful and efficient API for deep learning.

The Gluon API Specification The Gluon API specification is an effort to improve speed, flexibility, and accessibility of deep learning technology for

Gluon API 2.3k Dec 17, 2022
Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GanFormer and TransGan paper

TransGanFormer (wip) Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GansFormer and TransGan paper. I

Phil Wang 146 Dec 6, 2022
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

Gowthami Somepalli 284 Dec 21, 2022
PyTorch implementation of CloudWalk's recent work DenseBody

densebody_pytorch PyTorch implementation of CloudWalk's recent paper DenseBody. Note: For most recent updates, please check out the dev branch. Update

Lingbo Yang 401 Nov 19, 2022
A curated (most recent) list of resources for Learning with Noisy Labels

A curated (most recent) list of resources for Learning with Noisy Labels

Jiaheng Wei 321 Jan 9, 2023
Experimental Python implementation of OpenVINO Inference Engine (very slow, limited functionality). All codes are written in Python. Easy to read and modify.

PyOpenVINO - An Experimental Python Implementation of OpenVINO Inference Engine (minimum-set) Description The PyOpenVINO is a spin-off product from my

Yasunori Shimura 7 Oct 31, 2022
Experimental solutions to selected exercises from the book [Advances in Financial Machine Learning by Marcos Lopez De Prado]

Advances in Financial Machine Learning Exercises Experimental solutions to selected exercises from the book Advances in Financial Machine Learning by

Brian 1.4k Jan 4, 2023
A scientific and useful toolbox, which contains practical and effective long-tail related tricks with extensive experimental results

Bag of tricks for long-tailed visual recognition with deep convolutional neural networks This repository is the official PyTorch implementation of AAA

Yong-Shun Zhang 181 Dec 28, 2022
An experimental technique for efficiently exploring neural architectures.

SMASH: One-Shot Model Architecture Search through HyperNetworks An experimental technique for efficiently exploring neural architectures. This reposit

Andy Brock 478 Aug 4, 2022
Facilitating Database Tuning with Hyper-ParameterOptimization: A Comprehensive Experimental Evaluation

A Comprehensive Experimental Evaluation for Database Configuration Tuning This is the source code to the paper "Facilitating Database Tuning with Hype

DAIR Lab 9 Oct 29, 2022