A PyTorch Lightning solution to training OpenAI's CLIP from scratch.


A PyTorch Lightning solution to training CLIP from scratch.


Our aim is to create an easy to use Lightning implementation of OpenAI's clip training script. We want our end product to be as inline with the orignal paper as possible. We will live by:

CLIP Section Image


  • Get OpenAI's model creation script
  • Create model inits
    • ResNet50
    • ResNet50x4
    • ResNet101
    • ViT-B/32
    • all models
  • Create model wrapper
  • Create lightning trainer
  • Create dataset files
  • Performance boosts
    • Mixed-precision
    • Gradient checkpointing
    • Half-precision Adam statistics
    • Half-precision stochastically rounded text encoder weights
  • COCO-style DataLoader

    I would love to start training with this! I helped to write a Dataloader for the "COCO" format i.e. images and text files containing line separated captions. They are matched in the data loader via the unique basename of each file.


    Would it be possible to port that data loader to this project? It is perhaps of interest to some folks I know with some spare compute. Also personally useful to me, because I have converted a good deal of my collected datasets to this format already.


    opened by afiaka87 9
  • Problem related to encoding text

    I am trying to use a resnet50 model that I created with this repo, but I can't encode text.

    with torch.no_grad():
        tmp = clip.tokenize("test")
        tmp = tmp.to(device)
        text_encoded = model.model.encode_text(tmp)
    tensor([[49406,  1628, 49407,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0]], device='cuda:0')
    torch.Size([1, 77])
    ValueError                                Traceback (most recent call last)
    <ipython-input-18-68003eb3bebb> in <module>()
          9     print(tmp)
         10     print(tmp.shape)
    ---> 11     text_encoded = model.model.encode_text(tmp)
    2 frames
    /content/train-CLIP/models/model.py in encode_text(self, text)
        343         x = x + self.positional_embedding.type(self.dtype)
        344         x = x.permute(1, 0, 2)  # NLD -> LND
    --> 345         x = self.transformer(x)
        346         x = x.permute(1, 0, 2)  # LND -> NLD
        347         x = self.ln_final(x).type(self.dtype)
    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    /usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
        937         elif input_ids is not None:
        938             input_shape = input_ids.size()
    --> 939             batch_size, seq_length = input_shape
        940         elif inputs_embeds is not None:
        941             input_shape = inputs_embeds.size()[:-1]
    ValueError: too many values to unpack (expected 2)

    Printing x before self.transformer(x) results in torch.Size([77, 1, 512]).

    The input shape torch.Size([1, 77]) does match the original clip code and the model loaded with clip seems to work without major problems.

    import torch
    import clip
    from PIL import Image
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-B/32", device=device, jit=False)
    image = preprocess(Image.open("/test.png")).unsqueeze(0).to(device)
    text = clip.tokenize(["test"]).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        logits_per_image, logits_per_text = model(image, text)
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
    tensor([[49406,  1628, 49407,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
                 0,     0,     0,     0,     0,     0,     0]], device='cuda:0')
    torch.Size([1, 77])

    Not sure what I am doing wrong, since encoding images does seem to work fine with this repo.

    with torch.no_grad():
        photos_features = model.model.encode_image(image)
        photos_features /= photos_features.norm(dim=-1, keepdim=True)
    torch.Size([1, 768])
    opened by styler00dollar 6
  • Multi GPU training

    Thanks for sharing the code.

    I am not familiar with Lightning. It seems that the Code supports multiGPU (https://github.com/Zasder3/train-CLIP/blob/8d454de1999af4be93b6d99b956e83f005ff70dd/models/wrapper.py#L43), but I am not sure how to initiate the multi-GPU training.

    Besides, just to confirm, the code does not initialize the weights using the pretrained model?

    opened by niatzt 5
  • NotImplementedError: `train_dataloader` must be implemented to be used with the Lightning Trainer?

    Traceback (most recent call last): File "train_finetune.py", line 39, in main(args) File "train_finetune.py", line 29, in main trainer.fit(model, dm) File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1145, in _run self.accelerator.setup(self) File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu.py", line 46, in setup return super().setup(trainer) File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 93, in setup self.setup_optimizers(trainer) File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 355, in setup_optimizers trainer=trainer, model=self.lightning_module File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 245, in init_optimizers return trainer.init_optimizers(model) File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 35, in init_optimizers optim_conf = self.call_hook("configure_optimizers", pl_module=pl_module) File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook output = model_fx(*args, **kwargs) File "/workspace/cpfs-data/workspace_pytorch/hclip/train-CLIP/models/wrapper.py", line 337, in configure_optimizers first_cycle_steps=self.num_training_steps, File "/workspace/cpfs-data/workspace_pytorch/hclip/train-CLIP/models/wrapper.py", line 38, in num_training_steps dataset = self.train_dataloader() File "/workspace/cpfs-data/miniconda3/envs/tensorflow/lib/python3.7/site-packages/pytorch_lightning/core/hooks.py", line 477, in train_dataloader raise NotImplementedError("train_dataloader must be implemented to be used with the Lightning Trainer") NotImplementedError: train_dataloader must be implemented to be used with the Lightning Trainer

    opened by zhouwei5113 3
  • How to load a provided CLIP pre-trained model in your code?

    Hi Thanks for sharing, the code is neat and easy to follow. I have one question regarding fine-tuning a pre-trained CLIP. I notice that in your train_finetune.py, instead of directly loaded a pre-trainend CLIP model you encode two separately defined image encoder and text encoder. I wonder if I want to fine-tune a specific pre-trained CLIP model such as "ViT/32B", how I can properly load the image encoder and text encoder? Thank you for your answer.

    opened by zmykevin 3
  • Sinkhorn motivation

    Hi, thanks for the awesome implementation!

    I had a question with sinkhorn objective used in CustomWrapper. Is it motivated from here? Would be great if you could mention bit more about it.


    opened by vasudev13 3
  • model checkpointing

    Hey, Thank you for the lightning implementation, just what I needed at the moment! However, I'm a little confused about model checkpointing. I would assume it automatically saves the checkpoint to lightning_logs/checkpoints/, however after a full training run I didn't find anything saved in the checkpoints folder. I'm taking a deeper look into the repo and from first glance, I can see you didn't override that hook. I'm guessing the default checkpointing hook would not work since this is self-distillation (I'm using train_finetune.py btw) Let me know in case this is not expected behaviour.

    opened by sour4bh 3
  • what's the meaning of minibatch_size?

    Thank you for your CLIP training code! That's great!

    Training with your new commit 8d454de code, I get the following error: RuntimeError: The expanded size of the tensor (0) must match the existing size (8) at non-singleton dimension 0. Target sizes: [0, 1024]. Tensor sizes: [8, 1024]

    images_tmp[self.global_rank][j*self.minibatch_size:(j+1)*self.minibatch_size] = F.normalize(self.model.encode_image(mb), dim=1) minibatch_size = 0 Would you please explain the meaning of mimibatch_size ? How to use minibatch_size?

    opened by firestonelib 3
  • Error occurs when using DeepSpeed

    Hi @Zasder3, thank you for the great work!

    I was wondering if you tried to use DeepSpeed because I saw this commit log (DeepSpeed Optimizer indexing). When I tried DeepSpeed by adding --plugins deepspeed_stage_2, I've got below errors.

    Traceback (most recent call last):
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 8
    71, in run_train
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
    line 499, in run_training_epoch
        batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
    line 743, in run_training_batch
        self._curr_step_result = self.training_step(
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
    line 290, in training_step
        training_step_output = self.trainer.accelerator.training_step(args)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py
    ", line 204, in training_step
        return self.training_type_plugin.training_step(*args)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.p
    y", line 337, in training_step
        return self.model(*args, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _c
        return forward_call(*input, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1105, in f
        loss = self.module(*inputs, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _c
        return forward_call(*input, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deeps
    peed.py", line 62, in forward
        return super().forward(*inputs, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 46
    , in forward
        output = self.module.training_step(*inputs, **kwargs)
      File "/home/shared/workspace/multimodal-matching/multimodal-matching/train-CLIP/models/wrapper.py", line 106,
     in training_step
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 12
    52, in manual_backward
        self.trainer.train_loop.backward(loss, optimizer=None, opt_idx=None, *args, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py",
    line 867, in backward
        self.trainer.accelerator.backward(result, optimizer, opt_idx, should_accumulate, *args, **kwargs)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py
    ", line 306, in backward
        self.training_type_plugin.pre_backward(closure_loss, should_accumulate, optimizer, optimizer_idx)
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.p
    y", line 311, in pre_backward
        if not self.lightning_module.automatic_optimization and self.model.require_backward_grad_sync:
      File "/opt/conda/envs/clip_cuda11.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __
        raise AttributeError("'{}' object has no attribute '{}'".format(
    AttributeError: 'DeepSpeedEngine' object has no attribute 'require_backward_grad_sync'

    The error occurs in the below line, where we use self.automatic_optimization = False. https://github.com/Zasder3/train-CLIP/blob/ab1c59359a8e729fe05fd99aecdddf1eb9f43843/models/wrapper.py#L81

    I could use DeepSpeed by self.automatic_optimization = True without self.manual_backward(loss). (But still need some debugging because the training pattern changes.)

    My working environment are pytorch=1.9, cuda=11.1, pytorch-lightning=1.3.8. Thanks in advance!

    opened by kobiso 2
  • About the accuracy computation.

    Thanks for sharing you code. I am a little bit confused about the accuracy computation in L69-70 in wrapper.py:

    acc_i = (torch.argmax(image_logits) == ground_truth).sum() acc_t = (torch.argmax(image_logits.t()) == ground_truth).sum()

    It seems that torch.argmax retures the max value index accross all dimensions while ground_truth is with each row or column. Should we change to?

    acc_i = (torch.argmax(image_logits, 0) == ground_truth).sum() acc_t = (torch.argmax(image_logits.t(), 0) == ground_truth).sum().


    opened by rookiecm 2
  • Image encoder

    Is it possible to use a pre-trained image model from Hugging Face when trying to fine-tune? The latest models are usually there, so it would be pretty cool if it was compatible.

    opened by typercast 1
  • MisconfigurationException: `train_dataloader` must be implemented to be used with the Lightning Trainer

    i am trying to run train a model using the following command python train.py --model_name RN50 --folder ArchDaily --batch_size 512 --accelerator cuda

    i get the above error: File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/hooks.py", line 485, in train_dataloader raise MisconfigurationException("train_dataloader must be implemented to be used with the Lightning Trainer") pytorch_lightning.utilities.exceptions.MisconfigurationException: train_dataloader must be implemented to be used with the Lightning Trainer

    grateful for any assistance

    and here is full messages log

    Using 16bit native Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/configuration_validator.py:119: PossibleUserWarning: You defined a validation_step but have no val_dataloader. Skipping val loop. category=PossibleUserWarning, LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Traceback (most recent call last): File "train.py", line 31, in main(args) File "train.py", line 20, in main trainer.fit(model, dm) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 701, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 654, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 741, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run self.strategy.setup(self) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/strategies/single_device.py", line 74, in setup super().setup(trainer) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/strategies/strategy.py", line 153, in setup self.setup_optimizers(trainer) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/strategies/strategy.py", line 142, in setup_optimizers self.lightning_module File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/optimizer.py", line 179, in _init_optimizers_and_lr_schedulers optim_conf = model.trainer._call_lightning_module_hook("configure_optimizers", pl_module=model) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1549, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/content/train-CLIP/models/wrapper.py", line 146, in configure_optimizers first_cycle_steps=self.num_training_steps, File "/content/train-CLIP/models/wrapper.py", line 38, in num_training_steps dataset = self.train_dataloader() File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/hooks.py", line 485, in train_dataloader raise MisconfigurationException("train_dataloader must be implemented to be used with the Lightning Trainer") pytorch_lightning.utilities.exceptions.MisconfigurationException: train_dataloader must be implemented to be used with the Lightning Trainer

    opened by antitheos 2
  • Type Error while training

    File "/home/rishabh/Rishabclip/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 2430, in call "text input must of typestr(single example),List[str](batch or single pretokenized example) " ValueError: text input must of typestr(single example),List[str](batch or single pretokenized example) orList[List[str]](batch of pretokenized examples).

    Could you provide what are the versions of tokenizer and torch used in your code

    opened by rishav1122 0
  • manual_backward + fp16 training doesn't converge

    Hi, I borrowed some snippets from your codebase for the distributed GPU and minibatch-within-batch training in my own project. However, I found that training using manual_backward() + FP16 does not converge at all. If I switch to FP32, training works without any other code modifications. I'm using the latest pytorch-lightning v1.6.3. I wonder if you have observed similar issues?

    opened by LinxiFan 1
  • custom tokenizer and text encoder

    I want to use custom tokenizer and encoder trained from huggingface tokenizer.

    After training the huggingface tokenizer, I got a json containing vocas.

    However, I don't know how to feed this custom tokenizer with train_finetune.py.

    Could you give some guide to set and use custom tokenizer?

    opened by sinjohr 1
Cade Gordon
Cade Gordon
