Simple image captioning model - CLIP prefix captioning.


CLIP prefix captioning.

Inference Notebook:

🥳 New: 🥳 Integrated to Huggingface Spaces with Gradio. See demo: Hugging Face Spaces

🥳 New: 🥳 Run it in the browser using UI


Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. The features of the detected objects are then fed to an additional network that is trained to output the correct caption. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving close to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images.

In our work, we use the CLIP model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple Multi-Layer Perceptron (MLP) over the raw encoding, and then fine-tune our language model to generate a valid caption.

COCO Examples

A couple of people standing next to an elephant. A wooden table sitting in front of a window. A bunch of bananas sitting on top of a table.
A woman holding a plate with a piece of cake in front of her face. A wooden table topped with lots of wooden utensils. A red motorcycle parked on top of a dirt field.

Conceptual Captions Examples

3D render of a man holding a globe. Students enjoing the cherry blossoms Green leaf of lettuce on a white plate.
The hotel and casino on the waterfront. The triangle is a symbol of the soul. Cartoon boy in the bath.

Inference Notebooks

To help visualize the results we provide a Colab notebook found in notebooks/clip_prefix_captioning_inference.ipynb.
The notebook will download the pretrained models and run inference on a sample images or on images of your choosing. It is recommended to run this in Google Colab. Both COCO and Conceptual Captions pretrained models are available.

Inference GUI

Run it in the browser using UI.

COCO training

Clone, create environment and install dependencies:

git clone && cd CLIP_prefix_caption
conda env create -f environment.yml
conda activate clip_prefix_caption

Download train_captions to data/coco/annotations.

Download training images and validation images and unzip (We use Karpathy et el. split).

Extract CLIP features using (output is data/coco/oscar_split_train.pkl):



python --data ./data/coco/oscar_split_train.pkl --out_dir ./coco_train/

Qualitative results

COCO dataset

Oscar* 75.59 60.09 46.89 36.58 30.40 58.56 124.12 23.17
Ours 74.12 57.40 43.11 32.15 27.10 55.02 108.35 20.12

* uses additional object annotations for training.

Conceptual Captions dataset

VLP 24.35 77.57 16.59
Ours 26.71 87.26 18.5


This project was created by Ron Mokady and Amir Hertz for the Advanced-NLP course by Omer Levy @ TAU. This repository is heavily based on CLIP and Hugging-faces repositories. For training we used the data of COCO dataset and Conceptual Captions. The project was also inspired from this paper.


For any inquiry please contact us at our email addresses: [email protected] or [email protected].

  • Reproducing validation results

    Reproducing validation results


    Thanks for the great work! I was interested in reproducing the transformer network with frozen GPT-2, and achieved slightly lower performance on COCO so far:

    Metric | reported | reproduced -- | -- | -- Bleu@4 | 33.53 | 31.0 METEOR | 27.45 | 27.1 CIDEr | 113.08 | 105.7 SPICE | 21.05 | 20.4

    I was wondering if the provided code should be able to reproduce the validation scores or if I am missing something?

    opened by JHLee0513 7
  • A question about the x dimension.

    A question about the x dimension.

    When I train only transformer mapping network,I found that the dimension of x is(40 , 512),but prefix_dim = 640.I don't know why this is happening. Is it caused by the extraction of clip features? Hope to get your help, thank you.

    opened by tianjunyu0871 7
  • Training on Conceptual Captions

    Training on Conceptual Captions

    Are there any important changes that need to be made to the file to train a model with the the Conceptual Captions 3M dataset?

    I've been attempting to train a model myself using the author's recommended settings for training an MLP with GPT2 finetune. Hyperparameters such as learning rate, batch size, and num layers are all same as author. My only changes have been upgrading the CLIP to ViT_B/16 and GPT2 to gpt2-medium.

    Using the script I am able to get my training to run and the reported loss is decreasing through the epochs. The problem comes when I run inference. It always outputs "a model walks the runway at the fashion show during event." no matter what image I give it.

    Any guidance would be appreciated.

    opened by ryntwn88 6
  • Which CLIP model?

    Which CLIP model?

    Hi, can you tell me whether it is ViT-B/32 that you use to obtain the results reported in the paper? Did you run experiments also with a ResNet-based CLIP model and do you have any observations which works better? Thanks!

    opened by YovaKem 5
  • FInal token index when training a model

    FInal token index when training a model

    Hi, thanks a lot for making the code available, it's a great resource to use!

    I was wondering why the index when computing the loss from the output of gpt is shifted by 1 on the left:

    shouldn't it be logits = outputs.logits[:, dataset.prefix_length :]


    opened by robertodessi 4
  • The downloaded pretrained weights cannot be imported correctly using the jupyter notebook.

    The downloaded pretrained weights cannot be imported correctly using the jupyter notebook.

    I have downloaded the pretrain weights from Google Drive. But it seems there's something wrong with it. It's not working because of mismatching!

    How can I do for this?


    opened by cenjinglun 4
  • Evaluation Code

    Evaluation Code

    Hello, great project :)

    I wanted to try some stuff with you code, and wanted to evaluate the results with the same metrics you did on COCO or Conceptual Captions. Is it possible for you to share the evaluation code for these metrics? And what split of the dataset did you check those against?


    opened by AskingStuff 4
  • Issues When Training with WebDatasets/Larger GPT2 Model

    Issues When Training with WebDatasets/Larger GPT2 Model

    I have attempted to train with the gpt2-xl model from huggingface as well as a custom dataloader that can preprocess webdataset archives, but am having issues when it comes to inference.

    I have created a small script to test inference ( on my fork but it seems to repeatedly generate the same unrelated caption for any input (below are some samples from the RedCaps dataset):


    I'm not too sure what has gone wrong, but I believe it's either down to my dataloader not preprocessing the dataset in the correct way, or it's down to the model not supporting GPT2-xl.

    • The preprocessing script is at and is modified from rom1504/clip-retrieval's inference script. It does more of the transformations usually seen in beforehand to optimise the training speed, and generates many .npy files as a result.
    • These files are loaded with the custom dataloader on

    Do you know what might be causing this?

    Thanks in advance

    opened by TheoCoombes 3
  • Incorrect use of labels for GPT2?

    Incorrect use of labels for GPT2?


    As far as I understand the model and the usage of GPT2, shouldn't the get_dummy_token function return torch.ones() * -100 instead of torch.zeros()? This is because we should be ignoring the outputs of GPT2 for these prefix inputs. Currently, it's forcing the model to predict token 0 which is the exclamation mark ("!").

    Reference lines:


    opened by soumyasanyal 2
  • LABEL LEAKAGE During Training

    LABEL LEAKAGE During Training

    In code:, you refer to the validation set to get the training samples. Also, in your file:, there are validation captions (labels) in the training dataset, which results in obvious label leakage since the model will be trained on the validation samples. Actually if we train the model for more than 10 epochs (e.g. 100 epochs), the results in the validation set can reach an unreasonably high scores. Can you explain this?

    opened by ghost 2
  • Reproducing loss

    Reproducing loss

    Hi @rmokady, what a clever approach!

    I'm trying this approach on my custom dataset and manage to get it start training. I'm figuring out way to add evaluate code to better manage the training, but in the mean time, I wonder what is your loss score when stop training the model on each mode: Train only prefix and Train both prefix and GPT?

    opened by Luvata 2
  • Parsing conceptual caption does not function properly as it removes some images and replaces them with zero tensor.

    Parsing conceptual caption does not function properly as it removes some images and replaces them with zero tensor.

    Hi, I tried to re-train your model on Conceptual Caption dataset following this part in your readme.

    I found a series of problem for parsing Conceptual Caption dataset.

    1. In this line, it is believed that if an exception occurred while loading an image, the image variable will be initialized to 0; this occurs when an image file is malformed.

    2. I also found that if I use the provided script to download images, most of the downloaded images become corrupted.

    It may result in a model with undesirable behavior if it receives many zero images and maps it to a caption. I would greatly appreciate it if you could inform me if you are experiencing the same issue or if it is just mine.


    opened by mmderakhshani 0
  • model overfitting issue

    model overfitting issue

    image Dear authors, We have tested the COCO model with different frames, but outputting the same caption results. It seems to be a overfitting issue. Could you suggest if any parameters need to be specified at inference time, or it could be a data issue that the input frames are similar to some extent?

    opened by alchemz 0
  • AttributeError: module 'cog' has no attribute 'Predictor'

    AttributeError: module 'cog' has no attribute 'Predictor'

    Hello, I have a problem. When I run the file, I get this error: AttributeError: module 'cog' has no attribute 'Predictor'. I've tried several things and it doesn't work. How can I solve this problem? Thank you.

    opened by tianmianjiang 2
  • Upgrade to Cog version 0.1

    Upgrade to Cog version 0.1

    The new version of Cog improves the Python API, along with several other changes. Particularly pydantic is now used for Predictor and the previous version will be deprecated.

    This PR upgrades the Replicate demo and API to Cog version >= 0.1. I have already pushed this to Replicate, so you don't need to do anything for the demo to keep working :)

    opened by chenxwh 0
  • Question about

    Question about "clip_length" and "prefix_length" difference

    Hello. I've been trying to distinguish between "prefix_length" and "clip_length" and I kind of understand that prefix_length is the learnable part that you attach to GPT2 input, but not what clip_length is. In all your code I see that they are set to the same value ....... So, what is clip_length? Thank you !

    opened by YueyangLiulyy 0
  • it would be None

    it would be None I think if your labels are None you can create labels data. So it would be if labels is None

    opened by enes3774 0
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

CLIP-GEN [简体中文][English] 本项目在萤火二号集群上用 PyTorch 实现了论文 《CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP》。 CLIP-GEN 是一个 Language-F

null 75 Dec 29, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
Complete the code of prefix-tuning in low data setting

Prefix Tuning Note: 作者在论文中提到使用真实的word去初始化prefix的操作(Initializing the prefix with activations of real words,significantly improves generation)。我在使用作者提供的

Andrew Zeng 4 Jul 11, 2022
code for the ICLR'22 paper: On Robust Prefix-Tuning for Text Classification

On Robust Prefix-Tuning for Text Classification Prefix-tuning has drawed much attention as it is a parameter-efficient and modular alternative to adap

Zonghan Yang 12 Nov 30, 2022
CLIPImageClassifier wraps clip image model from transformers

CLIPImageClassifier CLIPImageClassifier wraps clip image model from transformers. CLIPImageClassifier is initialized with the argument classes, these

Jina AI 6 Sep 12, 2022
Simple implementation of OpenAI CLIP model in PyTorch.

It was in January of 2021 that OpenAI announced two new models: DALL-E and CLIP, both multi-modality models connecting texts and images in some way. In this article we are going to implement CLIP model from scratch in PyTorch. OpenAI has open-sourced some of the code relating to CLIP model but I found it intimidating and it was far from something short and simple. I also came across a good tutorial inspired by CLIP model on Keras code examples and I translated some parts of it into PyTorch to build this tutorial totally with our beloved PyTorch!

Moein Shariatnia 226 Jan 5, 2023
Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Deep Daze mist over green hills shattered plates on the grass cosmic love and attention a time traveler in the crowd life during the plague meditative

Phil Wang 4.4k Jan 3, 2023
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

Ryan Murdock has done it again, combining OpenAI's CLIP and the generator from a BigGAN! This repository wraps up his work so it is easily accessible to anyone who owns a GPU.

Phil Wang 2.3k Jan 9, 2023
A containerized REST API around OpenAI's CLIP model.

OpenAI's CLIP — REST API This is a container wrapping OpenAI's CLIP model in a RESTful interface. Running the container locally First, build the conta

Santiago Valdarrama 48 Nov 6, 2022
Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by

Mehdi Cherti 135 Dec 30, 2022
Working demo of the Multi-class and Anomaly classification model using the CLIP feature space

??️ Hindsight AI: Crime Classification With Clip About For Educational Purposes Only This is a recursive neural net trained to classify specific crime

Miles Tweed 2 Jun 5, 2022
CLIP+FFT text-to-image

Aphantasia This is a text-to-image tool, part of the artwork of the same name. Based on CLIP model, with FFT parameterizer from Lucent library as a ge

vadim epstein 690 Jan 2, 2023
CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Language–Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

Myeongjun Kim 52 Jan 7, 2023
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

null 458 Jan 2, 2023
Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP Abstract: We introduce a method that allows to automatically se

Daniil Pakhomov 134 Dec 19, 2022
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

Kevin Costa 73 Sep 11, 2022
A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

Eugenio Herrera 175 Dec 29, 2022
CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Galuh 17 Mar 10, 2022
Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

imgbeddings A Python package to generate embedding vectors from images, using OpenAI's robust CLIP model via Hugging Face transformers. These image em

Max Woolf 81 Jan 4, 2023