An open source implementation of CLIP.

Last update: Dec 31, 2022

Related tags

Deep Learning open_clip

Overview

OpenCLIP

Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training).

The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. Specifically, a ResNet-50 model trained with our codebase on OpenAI's 15 million image subset of YFCC achieves 32.7% top-1 accuracy on ImageNet. OpenAI's CLIP model reaches 31.3% when trained on the same subset of YFCC. For ease of experimentation, we also provide code for training on the 3 million images in the Conceptual Captions dataset, where a ResNet-50x4 trained with our codebase reaches 22.2% top-1 ImageNet accuracy.

As we describe in more detail below, CLIP models in a medium accuracy regime already allow us to draw conclusions about the robustness of larger CLIP models since the models follow reliable scaling laws.

This codebase is work in progress, and we invite all to contribute in making it more acessible and useful. In the future, we plan to add support for TPU training and release larger models. We hope this codebase facilitates and promotes further research in contrastive image-text learning.

Note that src/clip is a copy of OpenAI's official repository with minimal changes.

Data

Conceptual Captions

OpenCLIP reads a CSV file with two columns: a path to an image, and a text caption. The names of the columns are passed as an argument to main.py.

The script src/data/gather_cc.py will collect the Conceptual Captions images. First, download the Conceptual Captions URLs and then run the script from our repository:

python3 src/data/gather_cc.py path/to/Train_GCC-training.tsv path/to/Validation_GCC-1.1.0-Validation.tsv

Our training set contains 2.89M images, and our validation set contains 13K images.

YFCC and other datasets

In addition to specifying the training data via CSV files as mentioned above, our codebase also supports webdataset, which is recommended for larger scale datasets. The expected format is a series of .tar files. Each of these .tar files should contain two files for each training example, one for the image and one for the corresponding text. Both files should have the same name but different extensions. For instance, shard_001.tar could contain files such as abc.jpg and abc.txt. You can learn more about webdataset at https://github.com/webdataset/webdataset. We use .tar files with 1,000 data points each, which we create using tarp.

You can download the YFCC dataset from Multimedia Commons. Similar to OpenAI, we used a subset of YFCC to reach the aforementioned accuracy numbers. The indices of images in this subset are in OpenAI's CLIP repository.

Training CLIP

Install dependencies

conda env create -f environment.yml
source activate open_clip

Add directory to pythonpath:

cd open_clip
export PYTHONPATH="$PYTHONPATH:$PWD/src"

Sample running code:

nohup python -u src/training/main.py \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data="/path/to/train_data.csv"  \
    --val-data="/path/to/validation_data.csv"  \
    --csv-img-key filepath \
    --csv-caption-key title \
    --imagenet-val=/path/to/imagenet/root/val/ \
    --warmup 10000 \
    --batch-size=128 \
    --lr=1e-3 \
    --wd=0.1 \
    --epochs=30 \
    --workers=8 \
    --model RN50

Note: imagenet-val is the path to the validation set of ImageNet for zero-shot evaluation, not the training set! You can remove this argument if you do not want to perform zero-shot evaluation on ImageNet throughout training. Note that the val folder should contain subfolders. If it doest not, please use this script.

When run on a machine with 8 GPUs the command should produce the following training curve for Conceptual Captions:

More detailed curves for Conceptual Captions are given at /docs/clip_conceptual_captions.md.

When training a RN50 on YFCC the same hyperparameters as above are used, with the exception of lr=5e-4 and epochs=32.

Note that to use another model, like ViT-B/32 or RN50x4 or RN50x16 or ViT-B/16, specify with --model RN50x4.

Launch tensorboard:

tensorboard --logdir=logs/tensorboard/ --port=7777

Sample resuming from a checkpoint:

python src/training/main.py \
    --train-data="/path/to/train_data.csv" \
    --val-data="/path/to/validation_data.csv"  \
    --resume /path/to/checkpoints/epoch_K.pt

Sample evaluation only:

python src/training/main.py \
    --val-data="/path/to/validation_data.csv"  \
    --resume /path/to/checkpoints/epoch_K.pt

Trained models

You can find our ResNet-50 trained on YFCC-15M here.

Scaling trends

The plot below shows how zero-shot performance of CLIP models varies as we scale the number of samples used for training. Zero-shot performance increases steadily for both ImageNet and ImageNetV2, and is far from saturated at ~15M samples.

Why are low-accuracy CLIP models interesting?

TL;DR: CLIP models have high effective robustness, even at small scales.

CLIP models are particularly intriguing because they are more robust to natural distribution shifts (see Section 3.3 in the CLIP paper). This phenomena is illustrated by the figure below, with ImageNet accuracy on the x-axis and ImageNetV2 (a reproduction of the ImageNet validation set with distribution shift) accuracy on the y-axis. Standard training denotes training on the ImageNet train set and the CLIP zero-shot models are shown as stars.

As observed by Taori et al., 2020 and Miller et al., 2021, the in-distribution and out-of-distribution accuracies of models trained on ImageNet follow a predictable linear trend (the red line in the above plot). Effective robustness quantifies robustness as accuracy beyond this baseline, i.e., how far a model lies above the red line. Ideally a model would not suffer from distribution shift and fall on the y = x line (trained human labelers are within a percentage point of the y = x line).

Even though the CLIP models trained with this codebase achieve much lower accuracy than those trained by OpenAI, our models still lie on the same trend of improved effective robustness (the purple line). Therefore, we can study what makes CLIP robust without requiring industrial-scale compute.

For more more information on effective robustness, please see:

The Team

We are a group of researchers at UW, Google, Stanford, Amazon, Columbia, and Berkeley.

Gabriel Ilharco*, Mitchell Wortsman*, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, John Miller, Hongseok Namkoong, Hannaneh Hajishirzi, Ali Farhadi, Ludwig Schmidt

Special thanks to Jong Wook Kim and Alec Radford for help with reproducing CLIP!

Citing

If you found this repository useful, please consider citing:

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

Comments

adding CoCa

The PR idea is to add the CoCa model as implemented in https://github.com/lucidrains/CoCa-pytorch, using existing parts as much as possible.

Ideally adding possibilty to choose between custom and non custom Attention implementation as is done for CLIP.

opened by gpucce 76
Inference testing using random data
I have started to work on integration tests for with random data. This test runs on all pretrained models at fp32 with JIT True/False where applicable.

Related issue: #198

[x] Inference testing on pre-made input and gt

[x] all models as listed by list_models()

[x] Image

[x] Text

[x] Random test data generator

[x] Random image data in PIL format

[x] Random text data

[x] Determine best way to store and recall test data: ~4.6MB for all models with 1 sample per config

[ ] Parallelize tests, unlikely due to RAM constraints

To generate test data:

python tests/util_test.py --all

populates the tests/data folder with one torch pt per model config, to be used by the test.
opened by lopho 39
Text Tower Refactor

A refactor of #178 that will keep backwards compat for existing weights/models but provide a different base model for new models using custom text towers...

opened by rwightman 22
How to obtain logits (and probabilities) for 0-shot classification of single classes

First of all, thanks for the amazing work going into this repo! In the case where we want to return the probability of the presence of 1 class (e.g. "dog") in a set of images, how would we go about it? While (100.0 * image_features @ text_features.T).softmax(dim=-1) provides well-calibrated probabilities in the multi-class setting, (100.0 * image_features @ text_features.T).sigmoid() does not when we return the logits of only 1 class and have no other classes to compute the softmax against. From logits = np.dot(I_e, T_e.T) * np.exp(t) in Figure 3 of the CLIP paper, it would have to follow that t=4.6... given np.exp(t)=100 from the usage snippet in the README, is this correct (Edit: indeed model.logit_scale confirms this)? And wouldn't this be surprisingly consistent across architectures/training runs? I believe the OpenAI implementation initialises t=1/.07 leading to an initial scaling factor of approximately 14.29. This is then trained of course (link to code). Alternatively, could I try sampling a random, normalised vector as text_feature for a non-existing "non-dog" class and apply (100.0 * image_features @ text_features.T).softmax(dim=-1) as in the multi-class setting? Thanks

opened by arnaudvl 21
add `generate` to coca model

This PR should add the generate method to the CoCa model to add support for generation

based on https://github.com/lucidrains/x-transformers/blob/main/x_transformers/autoregressive_wrapper.py

opened by gpucce 16
Get well adjusted confidence scores from similarity of CLIP encodings

I am using CLIP to check similarity between text and an image. Now for example I have list of words (objects) I want to check against. For example (“elephant”, “tiger”, “giraffe”).

By taking the dot product of the encodings I get the similarity value. To evaluate the “confidence” I take the softmax over the outputs and it works very well predicting which class is in the image. But it could happen that the classes are not mutually exclusive. In that case softmax doesn’t make sense. I tried to use sigmoid as it is used with multi-label classification, but it seems to give me values all around 0.55 (so classes that were correct around around 0.56 and classes that are wrong 0.54), so in the example (0.565, 0.55, 0.62) if elephant and giraffe are in the picture. Thus it is hard to set a threshold there.

I would like to have something like (0.95, 0.05, 0.98) if elefant and giraffe are in the picture, thus the similarity is high for both words.

Am I thinking too complicated and there is a standard way to do this? Is it even possible to get this well adjusted confidence score?

opened by justlike-prog 15
Loss is constant

I'm using CLIP to train on my custom dataset with the following params:

Dataset size : 50k image-text pairs Batch size : 128 Image Size : 224 Gpus : 1 Epochs : 500

It's been running for a while now, I'm on my 15th epoch, and the loss hasn't changed at all. It isn't a constant number, but its constantly at 4.8xxx. Should I be concerned? I'm not sure why this is happening.

opened by tarunn2799 14

AttributeError Open CLIP has no attribute "create_model"

I am having issues loading CLIP models suddenly, without any change to our system almost 2 months. I am getting the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-4-145ac54c6c43> in <module>
    509 if RN101: print("Downloading CLIP Model: RN101");clip_models.append(clip.load('RN101', jit=False)[0].eval().requiresgrad(False).to(device))
    510 
--> 511 if ViTB32_laion2b_e16: print("Downloading CLIP Model: ViT-B/32 laion2b_e16"); clip_models.append(open_clip.create_model('ViT-B-32', pretrained='laion2b_e16').eval().requiresgrad(False).to(device))
    512 if ViTB32_laion400m_e31: print("Downloading CLIP Model: ViT-B-32 laion400m_e31"); clip_models.append(open_clip.create_model('ViT-B-32', pretrained='laion400m_e31').eval().requiresgrad(False).to(device))
    513 if ViTB32_laion400m_32: print("Downloading CLIP Model: ViT-B/32 laion400m_e32"); clip_models.append(open_clip.create_model('ViT-B-32', pretrained='laion400m_e32').eval().requiresgrad(False).to(device))

AttributeError: module 'open_clip' has no attribute 'create_model

Has this been documented or seen before?

To note, this seems to be happening randomly, like the whole module is crashing. One load of a CLIP model is fine, and the next it says there is no create_model method that it previously uses just fine.

opened by WASasquatch 12

Add support for gradient accumulation.

Added a new flag --accum-freq (accumulation frequency) which defaults to 1.

If this is greater than 1, then the optimizer is only stepped every --accum-freq batches.

Can be combined with gradient checkpointing.

Feature was requested in case people only have a few gpus but want to train with large batch.

We don't have to merge if people think it's makes things too complicated, and can instead close but point to upon request, but at least curious to hear thoughts.

For per-gpu batch size of m and --acum-freq k the effective per-gpu batch size is mk.

The basic psuedocode, when --accum-freq > 1 is:

accum_data, accum_features = [], []
for i, data in enumerate(dataloader):
  
  opt.zero_grad()
  
  # first, get the features for a bunch of batches without gradient tracking
  with no_grad:
    features = model(data)
  accum_data.append(data)
  accum_features.append(features)
  
  if (i + 1) % accum_freq > 0:
    continue
    
    
  # now re-compute the forward pass for the previous batches, with gradient tracking
  for j, data in enumerate(accum_data):
    features = model(data)
    all_features = cat(accum_features[:j], [features], accum_features[j+1:])
    loss = get_loss(all_features)
    loss.backward()
    
  optimizer.step()
  accum_data, accum_features = [], []

opened by mitchellnw 11

Naming clash in new CLIP models

I just cloned this repository on a Windows computer and saw the following:

PS C:\Users\585491\documents\research> git clone https://github.com/mlfoundations/open_clip.git
Cloning into 'open_clip'...
remote: Enumerating objects: 1637, done.
remote: Counting objects: 100% (74/74), done.
remote: Compressing objects: 100% (53/53), done.
remote: Total 1637 (delta 25), reused 49 (delta 17), pack-reused 1563
Receiving objects: 100% (1637/1637), 8.06 MiB | 10.91 MiB/s, done.
Resolving deltas: 100% (934/934), done.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'src/open_clip/model_configs/ViT-G-14.json'
  'src/open_clip/model_configs/ViT-g-14.json'
  'tests/data/output/ViT-G-14_None_fp32_random_image.pt'
  'tests/data/output/ViT-g-14_None_fp32_random_image.pt'
  'tests/data/output/ViT-G-14_None_fp32_random_text.pt'
  'tests/data/output/ViT-g-14_None_fp32_random_text.pt'

It would be nice if the names could be adjusted to be compliant with case-insensitive file systems.

opened by StellaAthena 11

Cannot reproduce your work

Hi team, I am trying to reproduce the numbers of your model on the Conceptual Captions dataset and am not able to reproduce the same numbers.

For example, I obtain a 0.0532 imagenet zeros shot top 1 val accuracy with RN50 config compares to the ~0.2 you report in the chapter of Loss Curves in README as follows.

I used dataset of cc12m but only about 3M data pair is got due to limited network. I used one node of 8 V100 GPUs with a batchsize of 150*8=1200 with default params(32 epoch)

My command is : torchrun --nproc_per_node 8 --nnodes=1 --node_rank=0 -m training.main --train-data '/dataset/cc12m/{00000..00640}.tar' --dataset-type webdataset --batch-size 150 --precision amp --workers 4 --imagenet-val=/dataset/imagenet/val --model RN50 --train-num-samples 3056141 --report-to tensorboard

Could you please help me to find what's wrong? Maybe it is the small dataset(3M data, 25% of cc12m)? or other hyperparams(like lr)?

Could you share your other hyperparams you used to obtain those results so that I can run the training with the exact same setup, or better yet share the exact command you used to run the training runs specific to the reported results, I would really appreciate that.

Thanks a lot.

opened by CloudRR 9
Fix braceexpand memory explosion for complex urls

Currently handling complex webdataset url patterns in args.train_data can lead to a unnecessary memory explosion, when using :: to concatenate multiple data sources. This can be correctly parsed by webdataset, using wds.shardlists.expand_urls(urls)).

See Issue https://github.com/mlfoundations/open_clip/issues/278.

opened by gabrielilharco 1
Resize embeddings and vocab

I finetuned text encoder of CLIP and added some additional tokens to it, I would like to know is there a way to load the checkpoint with greater embedding size?

opened by ambiSk 0
Is there a way to do multi-label classification with CLIP?

The concrete use case is a as following. I have the classes baby, child, teen, adult. My idea was to use similarity between text and image features (for text features I used the prompt 'there is at least one (c) in the photo', c being one of the 4 classes).

I went through quite a lot of examples, but I am running into the issue that the similarity scores are often very different for a fixed class or/and classes that appear might have a very similar threshold (like baby and child). For similarity scores I use the cosine similarity multiplied by 2.5 to stretch the score into the interval [0, 1] as is done in the CLIP Score paper.

Setting a threshold in that sense doesn't seem possible.

Does anyone have an idea for that? I feel quite stuck here, how I should proceed.

opened by justlike-prog 1
Support for initializing image tower with pretrained weights

Related to #332.

I tried to keep the modificatons constrained to factory.py and the configuration to pass. I tested with full initialization with pretrained weights from openai and laion, and also only initializing the image tower accordingly. Review is definitely needed and appreciated.

opened by Ja1Zhou 0
Best practice for supporting initialization from pretrained image tower with custom text tower?

An example would be the case described in the Chinese-clip paper. If my understandings are correct, currently this is hard to achieve without downloading and merging separate copies of both towers with custom code. I would like to add this feature and I wonder if I could get some advice if I were to merge this feature into the main branch.

opened by Ja1Zhou 1
Add TextTextCLIP
This pull request adds TextTextCLIP (CLIP-like text-to-text contrastive retrieval model) to the main branch. It is still a work in progress.

Tasks

[X] Add a config file for TextTextCLIP

[X] Add TextTextCLIP in model.py

[X] Modify factory.py to load model

[X] Modify data.py to load text data

[X] Modify `main.py' to train TextTextCLIP

[X] Test loading TextTextCLIP

[X] Test loading text-pair data.

[X] Test dummy training

[X] Rename variables
opened by lingjzhu 0

Releases(v2.9.1)

v2.9.1(Dec 29, 2022)

Source code(tar.gz)
Source code(zip)
v2.8.2(Dec 17, 2022)

Source code(tar.gz)
Source code(zip)
v2.8.1(Dec 15, 2022)

Source code(tar.gz)
Source code(zip)
v2.8.0(Dec 14, 2022)

Source code(tar.gz)
Source code(zip)
v2.7.0(Nov 18, 2022)

Source code(tar.gz)
Source code(zip)
v2.6.1(Nov 17, 2022)

Source code(tar.gz)
Source code(zip)
v2.6.0(Nov 17, 2022)

Source code(tar.gz)
Source code(zip)
v2.5.0(Nov 14, 2022)

Source code(tar.gz)
Source code(zip)
v2.4.1(Nov 10, 2022)

Source code(tar.gz)
Source code(zip)
v2.4.0(Nov 10, 2022)

Source code(tar.gz)
Source code(zip)
v2.3.1(Nov 7, 2022)

Source code(tar.gz)
Source code(zip)
v2.3.0(Nov 7, 2022)

Source code(tar.gz)
Source code(zip)
v2.2.0(Nov 7, 2022)

Source code(tar.gz)
Source code(zip)
v2.0.2(Sep 16, 2022)

Source code(tar.gz)
Source code(zip)
v2.0.1(Sep 15, 2022)

Source code(tar.gz)
Source code(zip)
v2.0.0(Sep 15, 2022)

Source code(tar.gz)
Source code(zip)
v1.3.0(Jun 3, 2022)

Source code(tar.gz)
Source code(zip)
v1.2.1(May 21, 2022)

Source code(tar.gz)
Source code(zip)
v1.2.0(May 20, 2022)

Source code(tar.gz)
Source code(zip)
v1.1.1(May 15, 2022)

Source code(tar.gz)
Source code(zip)
v1.0.1(Apr 26, 2022)

Source code(tar.gz)
Source code(zip)
v0.2.1(Apr 8, 2022)

Source code(tar.gz)
Source code(zip)
v0.2-weights(Apr 1, 2022)

This release tag is being used to host weights for various models trained with this codebase.

NOTE: The one included metric, zero-shot top-1 on ImageNet-1k does capture the full characteristics of the given pretrained weights. Evaluation on a broader set of zero-shot and validation tasks is required for a full comparison.

| model | dataset | weights | In1k zero-shot top-1 | | --- | --- | --- | --- | | RN50 | CC12M | rn50-quickgelu-cc12m | 36.45 | | RN50 | YFCC15M | rn50-quickgelu-yfcc15m | 32.73| | RN101 |YFCC15M | rn101-quickgelu-yfcc15m | 34.86 | | ViT-B-32 | LAION-400M | vit_b_32-quickgelu-laion400m_e31 | 62.96 | | ViT-B-32 | LAION-400M | vit_b_32-quickgelu-laion400m_e32 | 62.94 | | ViT-B-32 | LAION-2B | vit_b_32-laion2b_e16 | 65.62 | | ViT-B-16 | LAION-400M | vit_b_16-laion400m_e31 | 66.98 | | ViT-B-16 | LAION-400M | vit_b_16-laion400m_e32 | 67.07 | | ViT-B-16-plus-240 | LAION-400M | vit_b_16-laion400m_e31 | 69.06 | | ViT-B-16-plus-240 | LAION-400M | vit_b_16-laion400m_e32 | 69.21 | | ViT-L-14 | LAION-400M | vit_b_14-laion400m_e31 | 72.70 | | ViT-L-14 | LAION-400M | vit_b_14-laion400m_e32 | 72.77 |
Source code(tar.gz)
Source code(zip)
rn101-quickgelu-yfcc15m-3e04b30e.pt(457.09 MB)
rn50-quickgelu-cc12m-f000538c.pt(389.40 MB)
rn50-quickgelu-yfcc15m-455df137.pt(389.40 MB)
vit_b_16-laion400m_e31-00efa78f.pt(570.80 MB)
vit_b_16-laion400m_e32-55e67d44.pt(570.80 MB)
vit_b_16_plus_240-laion400m_e31-8fb26589.pt(794.94 MB)
vit_b_16_plus_240-laion400m_e32-699c4b84.pt(794.94 MB)
vit_b_32-laion2b_e16-af8dbd0c.pth(577.12 MB)
vit_b_32-quickgelu-laion400m_avg-8a00ab3c.pt(577.12 MB)
vit_b_32-quickgelu-laion400m_e31-d867053b.pt(577.12 MB)
vit_b_32-quickgelu-laion400m_e32-46683a32.pt(577.12 MB)
vit_l_14-laion400m_e31-69988bb6.pt(1631.30 MB)
vit_l_14-laion400m_e32-3d133497.pt(1631.30 MB)
v0.1(Jul 28, 2021)

Welcome to the initial release of open_clip, an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training).

The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset.
Source code(tar.gz)
Source code(zip)

Owner

GitHub

An open source implementation of CLIP.

OpenCLIP Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training). The goal of this repository is to enable

2.7k Dec 31, 2022

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

简体中文 | English PaddleRobotics paddleRobotics是基于paddle的机器人开源算法库集，包括人机交互、复杂运动控制、环境感知、slam定位导航等开源算法部分。人机交互主动多模交互技术TFVT-HRI 主动多模交互技术是通过视觉、语音、触摸传感器等输入机器人

185 Dec 26, 2022

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 2, 2023

Simple implementation of OpenAI CLIP model in PyTorch.

It was in January of 2021 that OpenAI announced two new models: DALL-E and CLIP, both multi-modality models connecting texts and images in some way. In this article we are going to implement CLIP model from scratch in PyTorch. OpenAI has open-sourced some of the code relating to CLIP model but I found it intimidating and it was far from something short and simple. I also came across a good tutorial inspired by CLIP model on Keras code examples and I translated some parts of it into PyTorch to build this tutorial totally with our beloved PyTorch!

226 Jan 5, 2023

Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Wav2CLIP ?? WIP ?? Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP ?? ?? Ho-Hsiang Wu, Prem Seetharaman

240 Dec 13, 2022

A concise but complete implementation of CLIP with various experimental improvements from recent papers

x-clip (wip) A concise but complete implementation of CLIP with various experimental improvements from recent papers Install $ pip install x-clip Usag

515 Dec 26, 2022

A concise but complete implementation of CLIP with various experimental improvements from recent papers

x-clip (wip) A concise but complete implementation of CLIP with various experimental improvements from recent papers Install $ pip install x-clip Usag

115 Dec 9, 2021

Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space"

MotionCLIP Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space". Please visit our webpage for mor

173 Dec 26, 2022

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Deep Daze mist over green hills shattered plates on the grass cosmic love and attention a time traveler in the crowd life during the plague meditative

4.4k Jan 3, 2023

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

An open source implementation of CLIP.

Related tags

Overview

OpenCLIP

Data

Conceptual Captions

YFCC and other datasets

Training CLIP

Install dependencies

Add directory to pythonpath:

Sample running code:

Launch tensorboard:

Sample resuming from a checkpoint:

Sample evaluation only:

Trained models

Scaling trends

Why are low-accuracy CLIP models interesting?

The Team

Citing

Comments

Releases(v2.9.1)

v2.9.1(Dec 29, 2022)

v2.8.2(Dec 17, 2022)

v2.8.1(Dec 15, 2022)

v2.8.0(Dec 14, 2022)

v2.7.0(Nov 18, 2022)

v2.6.1(Nov 17, 2022)

v2.6.0(Nov 17, 2022)

v2.5.0(Nov 14, 2022)

v2.4.1(Nov 10, 2022)

v2.4.0(Nov 10, 2022)

v2.3.1(Nov 7, 2022)

v2.3.0(Nov 7, 2022)

v2.2.0(Nov 7, 2022)

v2.0.2(Sep 16, 2022)

v2.0.1(Sep 15, 2022)

v2.0.0(Sep 15, 2022)

v1.3.0(Jun 3, 2022)

v1.2.1(May 21, 2022)

v1.2.0(May 20, 2022)

v1.1.1(May 15, 2022)

v1.0.1(Apr 26, 2022)

v0.2.1(Apr 8, 2022)

v0.2-weights(Apr 1, 2022)

v0.1(Jul 28, 2021)

Owner

An open source implementation of CLIP.

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Simple implementation of OpenAI CLIP model in PyTorch.

Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

A concise but complete implementation of CLIP with various experimental improvements from recent papers

A concise but complete implementation of CLIP with various experimental improvements from recent papers

Official Pytorch implementation of the paper "MotionCLIP: Exposing Human Motion Generation to CLIP Space"

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP+FFT text-to-image

Navigating StyleGAN2 w latent space using CLIP

RANZCR-CLiP 7th Place Solution

A containerized REST API around OpenAI's CLIP model.

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.