A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

Related tags

train-CLIP
Overview

train-CLIP πŸ“Ž

A PyTorch Lightning solution to training CLIP from scratch.

Goal ⚽

Our aim is to create an easy to use Lightning implementation of OpenAI's clip training script. We want our end product to be as inline with the orignal paper as possible. We will live by:

CLIP Section Image

TODO βœ…

  • Get OpenAI's model creation script
  • Create model inits
    • ResNet50
    • ResNet50x4
    • ResNet101
    • ViT-B/32
    • all models
  • Create model wrapper
  • Create lightning trainer
  • Create dataset files
  • Performance boosts
    • Mixed-precision
    • Gradient checkpointing
    • Half-precision Adam statistics
    • Half-precision stochastically rounded text encoder weights
Issues
  • COCO-style DataLoader

    COCO-style DataLoader

    I would love to start training with this! I helped to write a Dataloader for the "COCO" format i.e. images and text files containing line separated captions. They are matched in the data loader via the unique basename of each file.

    https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/loader.py

    Would it be possible to port that data loader to this project? It is perhaps of interest to some folks I know with some spare compute. Also personally useful to me, because I have converted a good deal of my collected datasets to this format already.

    Thanks!

    opened by afiaka87 9
  • ModuleNotFoundError: No module named 'clip'

    ModuleNotFoundError: No module named 'clip'

    Thanks for your great work! when I call train.py with my dataset. I have ModuleNotFoundError: No module named 'clip'. Do I miss something?

    opened by Jiang15 8
  • Problem related to encoding text

    Problem related to encoding text

    I am trying to use a resnet50 model that I created with this repo, but I can't encode text.

    with torch.no_grad():
        tmp = clip.tokenize("test")
        tmp = tmp.to(device)
        print(tmp)
        print(tmp.shape)
        text_encoded = model.model.encode_text(tmp)
    
    tensor([[49406,  1628, 49407,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0]], device='cuda:0')
    torch.Size([1, 77])
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-18-68003eb3bebb> in <module>()
          9     print(tmp)
         10     print(tmp.shape)
    ---> 11     text_encoded = model.model.encode_text(tmp)
         12 
    
    2 frames
    /content/train-CLIP/models/model.py in encode_text(self, text)
        343         x = x + self.positional_embedding.type(self.dtype)
        344         x = x.permute(1, 0, 2)  # NLD -> LND
    --> 345         x = self.transformer(x)
        346         x = x.permute(1, 0, 2)  # LND -> NLD
        347         x = self.ln_final(x).type(self.dtype)
    
    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    /usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
        937         elif input_ids is not None:
        938             input_shape = input_ids.size()
    --> 939             batch_size, seq_length = input_shape
        940         elif inputs_embeds is not None:
        941             input_shape = inputs_embeds.size()[:-1]
    
    ValueError: too many values to unpack (expected 2)
    

    Printing x before self.transformer(x) results in torch.Size([77, 1, 512]).

    The input shape torch.Size([1, 77]) does match the original clip code and the model loaded with clip seems to work without major problems.

    import torch
    import clip
    from PIL import Image
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-B/32", device=device, jit=False)
    
    image = preprocess(Image.open("/test.png")).unsqueeze(0).to(device)
    text = clip.tokenize(["test"]).to(device)
    print(text)
    print(text.shape)
    
    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        
        logits_per_image, logits_per_text = model(image, text)
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
    
    tensor([[49406,  1628, 49407,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0]], device='cuda:0')
    torch.Size([1, 77])
    

    Not sure what I am doing wrong, since encoding images does seem to work fine with this repo.

    with torch.no_grad():
        photos_features = model.model.encode_image(image)
        photos_features /= photos_features.norm(dim=-1, keepdim=True)
    
    print(photos_features.shape)
    
    torch.Size([1, 768])
    
    opened by styler00dollar 6
  • Multi GPU training

    Multi GPU training

    Thanks for sharing the code.

    I am not familiar with Lightning. It seems that the Code supports multiGPU (https://github.com/Zasder3/train-CLIP/blob/8d454de1999af4be93b6d99b956e83f005ff70dd/models/wrapper.py#L43), but I am not sure how to initiate the multi-GPU training.

    Besides, just to confirm, the code does not initialize the weights using the pretrained model?

    opened by niatzt 5
  • model checkpointing

    model checkpointing

    Hey, Thank you for the lightning implementation, just what I needed at the moment! However, I'm a little confused about model checkpointing. I would assume it automatically saves the checkpoint to lightning_logs/checkpoints/, however after a full training run I didn't find anything saved in the checkpoints folder. I'm taking a deeper look into the repo and from first glance, I can see you didn't override that hook. I'm guessing the default checkpointing hook would not work since this is self-distillation (I'm using train_finetune.py btw) Let me know in case this is not expected behaviour.

    opened by sour4bh 3
  • About the accuracy computation.

    About the accuracy computation.

    Thanks for sharing you code. I am a little bit confused about the accuracy computation in L69-70 in wrapper.py:

    acc_i = (torch.argmax(image_logits) == ground_truth).sum() acc_t = (torch.argmax(image_logits.t()) == ground_truth).sum()

    It seems that torch.argmax retures the max value index accross all dimensions while ground_truth is with each row or column. Should we change to?

    acc_i = (torch.argmax(image_logits, 0) == ground_truth).sum() acc_t = (torch.argmax(image_logits.t(), 0) == ground_truth).sum().

    Thanks.

    opened by rookiecm 2
  • what's the meaning of minibatch_size?

    what's the meaning of minibatch_size?

    Thank you for your CLIP training code! That's great!

    Training with your new commit 8d454de code, I get the following error: RuntimeError: The expanded size of the tensor (0) must match the existing size (8) at non-singleton dimension 0. Target sizes: [0, 1024]. Tensor sizes: [8, 1024]

    images_tmp[self.global_rank][j*self.minibatch_size:(j+1)*self.minibatch_size] = F.normalize(self.model.encode_image(mb), dim=1) minibatch_size = 0 Would you please explain the meaning of mimibatch_size ? How to use minibatch_size?

    opened by firestonelib 2
  • Error occurs when using DeepSpeed

    Error occurs when using DeepSpeed

    Hi @Zasder3, thank you for the great work!

    I was wondering if you tried to use DeepSpeed because I saw this commit log (DeepSpeed Optimizer indexing). When I tried DeepSpeed by adding --plugins deepspeed_stage_2, I've got below errors.

    Traceback (most recent call last):
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 8
    71, in run_train
        self.train_loop.run_training_epoch()
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
    line 499, in run_training_epoch
        batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
    line 743, in run_training_batch
        self._curr_step_result = self.training_step(
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
    line 290, in training_step
        training_step_output = self.trainer.accelerator.training_step(args)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py
    ", line 204, in training_step
        return self.training_type_plugin.training_step(*args)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.p
    y", line 337, in training_step
        return self.model(*args, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _c
    all_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1105, in f
    orward
        loss = self.module(*inputs, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _c
    all_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deeps
    peed.py", line 62, in forward
        return super().forward(*inputs, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 46
    , in forward
        output = self.module.training_step(*inputs, **kwargs)
      File "/home/shared/workspace/multimodal-matching/multimodal-matching/train-CLIP/models/wrapper.py", line 106,
     in training_step
        self.manual_backward(loss)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 12
    52, in manual_backward
        self.trainer.train_loop.backward(loss, optimizer=None, opt_idx=None, *args, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
    line 867, in backward
        self.trainer.accelerator.backward(result, optimizer, opt_idx, should_accumulate, *args, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py
    ", line 306, in backward
        self.training_type_plugin.pre_backward(closure_loss, should_accumulate, optimizer, optimizer_idx)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.p
    y", line 311, in pre_backward
        if not self.lightning_module.automatic_optimization and self.model.require_backward_grad_sync:
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __
    getattr__
        raise AttributeError("'{}' object has no attribute '{}'".format(
    AttributeError: 'DeepSpeedEngine' object has no attribute 'require_backward_grad_sync'
    

    The error occurs in the below line, where we use self.automatic_optimization = False. https://github.com/Zasder3/train-CLIP/blob/ab1c59359a8e729fe05fd99aecdddf1eb9f43843/models/wrapper.py#L81

    I could use DeepSpeed by self.automatic_optimization = True without self.manual_backward(loss). (But still need some debugging because the training pattern changes.)

    My working environment are pytorch=1.9, cuda=11.1, pytorch-lightning=1.3.8. Thanks in advance!

    opened by kobiso 2
  • Fix calculation of logits for training loss

    Fix calculation of logits for training loss

    The embeddings must be l2 normalized before calculating the logits. This fix corrects both the loss that is logged and the training loss. The validation loss was already correct, as the model forward function normalizes the embeddings before computing the logits.

    opened by bob80333 1
  • Looks like loss is wrong

    Looks like loss is wrong

    Isn't the similarity mat sharded? So you would gather after image_logits = torch.cat(ims) @ torch.cat(txt).t() (line 66 wrapper.py) not before.

    opened by a1526772 1
  • Assertion error

    Assertion error

    Hi, Can somebody please help me out here why this error is coming?

    Using native 16bit precision. GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs /home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/configuration_validator.py:101: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop rank_zero_warn(f"you defined a {step_name} but have no {loader_name}. Skipping {stage} loop") Path FeatureStore LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Traceback (most recent call last): File "train_finetune.py", line 33, in main(args) File "train_finetune.py", line 23, in main trainer.fit(model, dm) File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit self._run(model) File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 873, in _run self.accelerator.setup(self, model) # note: this sets up self.lightning_module File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu.py", line 42, in setup return super().setup(trainer, model) File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 88, in setup self.setup_optimizers(trainer) File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 331, in setup_optimizers trainer=trainer, model=self.lightning_module File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 223, in init_optimizers return trainer.init_optimizers(model) File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/optimizers.py", line 34, in init_optimizers optim_conf = model.configure_optimizers() File "/home/ubuntu/clip/train-CLIP/models/wrapper.py", line 343, in configure_optimizers warmup_steps=2000 File "/home/ubuntu/.local/lib/python3.6/site-packages/cosine_annealing_warmup/scheduler.py", line 27, in init assert warmup_steps < first_cycle_steps AssertionError

    opened by anirudha16101 1
  • Dataset structure

    Dataset structure

    Hi I'm having a little trouble understanding the dataset structure that I should follow in order to be able to train with this package. Is it one parent folder, one folder containing images and one folder containing their text files? If yes, what should these subfolders be named?

    opened by tarunn2799 7
  • WebDataset support

    WebDataset support

    I think it could be pretty useful to add a webdataset loader to this, so webdataset datasets can be used here. This is relevant as large webdataset are starting to be available (one is crawling at home of size 400M)

    I think https://github.com/lucidrains/DALLE-pytorch/pull/280/files may be a good example on how to do it

    opened by rom1504 2
  • Explanation

    Explanation

    Can you please add the little more detailed explanation of the code and the self-distillation technique used to make it more efficient. Various parts of the code are somewhat hard to understand and thus I would request you add some explanation regarding the code as well.

    opened by Akashcodes732 0
  • Loading checkpoint

    Loading checkpoint

    Hi there. Can you give some advice in how to load a checkpoint from a trained model with your pytorch lightning wrapper for inference? I used the common pytroch lightning method "load_from_checkpoint" but did not have any luck so far. Thanks

    opened by turicumoperarius 2
  • "open_clip"

    https://github.com/mlfoundations/open_clip

    opened by afiaka87 1
  • About Gradient Checkpointing

    About Gradient Checkpointing

    I wrote the codes about gradient checkpointing. Can I contribute some codes to this project?

    opened by ChawDoe 2
Owner
Cade Gordon
Cade Gordon
Generate vibrant and detailed images using only text.

CLIP Guided Diffusion From RiversHaveWings. Generate vibrant and detailed images using only text. See captions and more generations in the Gallery See

Clay M. 80 Sep 26, 2021
Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pytorch Lightning 1.1k Sep 25, 2021
Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Nerdy Rodent 641 Sep 26, 2021
Toolbox of models, callbacks, and datasets for AI/ML researchers.

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch Website β€’ Installation β€’ Main

Pytorch Lightning 1.1k Sep 21, 2021
Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

CLIP-Guided-Diffusion Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab. Original colab notebooks by Ka

Nerdy Rodent 12 Sep 21, 2021
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 128 Sep 17, 2021
Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning.

Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning Installation

Pytorch Lightning 1.2k Sep 23, 2021
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Machine Learning From Scratch About Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The purpose

Erik Linder-NorΓ©n 20.2k Sep 25, 2021
An open source implementation of CLIP.

OpenCLIP Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training). The goal of this repository is to enable

null 274 Sep 21, 2021
The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

This is a curated list of tutorials, projects, libraries, videos, papers, books and anything related to the incredible PyTorch. Feel free to make a pu

Ritchie Ng 8.4k Sep 22, 2021
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

Kevin Costa 35 Sep 16, 2021
Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

AI Secure 32 Sep 12, 2021
Simple implementation of OpenAI CLIP model in PyTorch.

It was in January of 2021 that OpenAI announced two new models: DALL-E and CLIP, both multi-modality models connecting texts and images in some way. In this article we are going to implement CLIP model from scratch in PyTorch. OpenAI has open-sourced some of the code relating to CLIP model but I found it intimidating and it was far from something short and simple. I also came across a good tutorial inspired by CLIP model on Keras code examples and I translated some parts of it into PyTorch to build this tutorial totally with our beloved PyTorch!

Moein Shariatnia 72 Sep 23, 2021
Collection of generative models in Pytorch version.

pytorch-generative-model-collections Original : [Tensorflow version] Pytorch implementation of various GANs. This repository was re-implemented with r

Hyeonwoo Kang 2.3k Sep 16, 2021
Pytorch Lightning Distributed Accelerators using Ray

Distributed PyTorch Lightning Training on Ray This library adds new PyTorch Lightning accelerators for distributed training using the Ray distributed

null 80 Sep 23, 2021
Pytorch Lightning Distributed Accelerators using Ray

Distributed PyTorch Lightning Training on Ray This library adds new PyTorch Lightning plugins for distributed training using the Ray distributed compu

null 80 Sep 20, 2021
Semantic Segmentation with Pytorch-Lightning

This is a simple demo for performing semantic segmentation on the Kitti dataset using Pytorch-Lightning and optimizing the neural network by monitoring and comparing runs with Weights & Biases.

Boris Dayma 36 Sep 22, 2021
Generic template to bootstrap your PyTorch project with PyTorch Lightning, Hydra, W&B, and DVC.

NN Template Generic template to bootstrap your PyTorch project. Click on Use this Template and avoid writing boilerplate code for: PyTorch Lightning,

Luca Moschella 277 Sep 22, 2021
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Hao Tan 47 Sep 8, 2021