An open source implementation of CLIP.

Related tags

open_clip
Overview

OpenCLIP

Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training).

The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. Specifically, a ResNet-50 model trained with our codebase on OpenAI's 15 million image subset of YFCC achieves 32.7% top-1 accuracy on ImageNet. OpenAI's CLIP model reaches 31.3% when trained on the same subset of YFCC. For ease of experimentation, we also provide code for training on the 3 million images in the Conceptual Captions dataset, where a ResNet-50x4 trained with our codebase reaches 22.2% top-1 ImageNet accuracy.

As we describe in more detail below, CLIP models in a medium accuracy regime already allow us to draw conclusions about the robustness of larger CLIP models since the models follow reliable scaling laws.

This codebase is work in progress, and we invite all to contribute in making it more acessible and useful. In the future, we plan to add support for TPU training and release larger models. We hope this codebase facilitates and promotes further research in contrastive image-text learning.

Note that src/clip is a copy of OpenAI's official repository with minimal changes.

Data

Conceptual Captions

OpenCLIP reads a CSV file with two columns: a path to an image, and a text caption. The names of the columns are passed as an argument to main.py.

The script src/data/gather_cc.py will collect the Conceptual Captions images. First, download the Conceptual Captions URLs and then run the script from our repository:

python3 src/data/gather_cc.py path/to/Train_GCC-training.tsv path/to/Validation_GCC-1.1.0-Validation.tsv

Our training set contains 2.89M images, and our validation set contains 13K images.

YFCC and other datasets

In addition to specifying the training data via CSV files as mentioned above, our codebase also supports webdataset, which is recommended for larger scale datasets. The expected format is a series of .tar files. Each of these .tar files should contain two files for each training example, one for the image and one for the corresponding text. Both files should have the same name but different extensions. For instance, shard_001.tar could contain files such as abc.jpg and abc.txt. You can learn more about webdataset at https://github.com/webdataset/webdataset. We use .tar files with 1,000 data points each, which we create using tarp.

You can download the YFCC dataset from Multimedia Commons. Similar to OpenAI, we used a subset of YFCC to reach the aforementioned accuracy numbers. The indices of images in this subset are in OpenAI's CLIP repository.

Training CLIP

Install dependencies

conda env create -f environment.yml
source activate open_clip

Add directory to pythonpath:

cd open_clip
export PYTHONPATH="$PYTHONPATH:$PWD/src"

Sample running code:

nohup python -u src/training/main.py \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data="/path/to/train_data.csv"  \
    --val-data="/path/to/validation_data.csv"  \
    --csv-img-key filepath \
    --csv-caption-key title \
    --imagenet-val=/path/to/imagenet/root/val/ \
    --warmup 10000 \
    --batch-size=128 \
    --lr=1e-3 \
    --wd=0.1 \
    --epochs=30 \
    --workers=8 \
    --model RN50

Note: imagenet-val is the path to the validation set of ImageNet for zero-shot evaluation, not the training set! You can remove this argument if you do not want to perform zero-shot evaluation on ImageNet throughout training. Note that the val folder should contain subfolders. If it doest not, please use this script.

When run on a machine with 8 GPUs the command should produce the following training curve for Conceptual Captions:

CLIP zero shot training curve

More detailed curves for Conceptual Captions are given at /docs/clip_conceptual_captions.md.

When training a RN50 on YFCC the same hyperparameters as above are used, with the exception of lr=5e-4 and epochs=32.

Note that to use another model, like ViT-B/32 or RN50x4 or RN50x16 or ViT-B/16, specify with --model RN50x4.

Launch tensorboard:

tensorboard --logdir=logs/tensorboard/ --port=7777

Sample resuming from a checkpoint:

python src/training/main.py \
    --train-data="/path/to/train_data.csv" \
    --val-data="/path/to/validation_data.csv"  \
    --resume /path/to/checkpoints/epoch_K.pt

Sample evaluation only:

python src/training/main.py \
    --val-data="/path/to/validation_data.csv"  \
    --resume /path/to/checkpoints/epoch_K.pt

Trained models

You can find our ResNet-50 trained on YFCC-15M here.

Scaling trends

The plot below shows how zero-shot performance of CLIP models varies as we scale the number of samples used for training. Zero-shot performance increases steadily for both ImageNet and ImageNetV2, and is far from saturated at ~15M samples.

Why are low-accuracy CLIP models interesting?

TL;DR: CLIP models have high effective robustness, even at small scales.

CLIP models are particularly intriguing because they are more robust to natural distribution shifts (see Section 3.3 in the CLIP paper). This phenomena is illustrated by the figure below, with ImageNet accuracy on the x-axis and ImageNetV2 (a reproduction of the ImageNet validation set with distribution shift) accuracy on the y-axis. Standard training denotes training on the ImageNet train set and the CLIP zero-shot models are shown as stars.

CLIP scatter plot

As observed by Taori et al., 2020 and Miller et al., 2021, the in-distribution and out-of-distribution accuracies of models trained on ImageNet follow a predictable linear trend (the red line in the above plot). Effective robustness quantifies robustness as accuracy beyond this baseline, i.e., how far a model lies above the red line. Ideally a model would not suffer from distribution shift and fall on the y = x line (trained human labelers are within a percentage point of the y = x line).

Even though the CLIP models trained with this codebase achieve much lower accuracy than those trained by OpenAI, our models still lie on the same trend of improved effective robustness (the purple line). Therefore, we can study what makes CLIP robust without requiring industrial-scale compute.

For more more information on effective robustness, please see:

The Team

We are a group of researchers at UW, Google, Stanford, Amazon, Columbia, and Berkeley.

Gabriel Ilharco*, Mitchell Wortsman*, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, John Miller, Hongseok Namkoong, Hannaneh Hajishirzi, Ali Farhadi, Ludwig Schmidt

Special thanks to Jong Wook Kim and Alec Radford for help with reproducing CLIP!

Citing

If you found this repository useful, please consider citing:

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}
@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

DOI

Issues
  • Possible to finetune?

    Possible to finetune?

    Is it possible to finetune from the existing Open AI checkpoints rather than train them from scratch with this codebase?

    opened by afiaka87 8
  • Generating prompts from an image

    Generating prompts from an image

    so - I've been looking into some code for VQGAN https://github.com/mehdidc/feed_forward_vqgan_clip https://github.com/nerdyrodent/VQGAN-CLIP

    and they let the user to pass a prompt to style / generate an image. Here's some using code from @nerdyrodent https://github.com/nerdyrodent/VQGAN-CLIP/issues/13

    Must see - https://twitter.com/e08477/status/1418440857578098691?s=21 Here's theres only 4 images generated with a prompt eg. Mushroom, spaceship,volcano, old english house on a hill(might be wrong) image But then as you look down - these have predicate prompts that style / shape image differently.

    Mushroom + marble sculpture.

    What I want is to give an image to CLIP and have it tell me what it thinks the words should be. Is this feasible / achievable ? Does this repo provide any way into this? Does it need dimensionality reduction? It is like tsne problem (show word2vec in 2 dimensions?) - but under the hood it's 512 dimensions? I'm yet to look at the code - maybe it will become clearer.

    opened by johndpope 5
  • Usage of title and/or description column in YFCC100M

    Usage of title and/or description column in YFCC100M

    Hello,

    In your training of CLIP, did you use only the description column as text input, or both the title and description columns?

    The reason I am asking is because in the github folder where OpenAI provide info on their YFCC100M subset, there is a sentence that I find quite ambiguous:

    [...] which have been filtered to only keep those with natural languag titles and/or descriptions in English

    This seems to imply that it sufficed that only one of title and description was considered natural language for an observation (image) to be kept as part of the subset. However, they do not clarify whether they also proceeded to use the results of this natural language filter to choose whether to use only the title or only the description in the case that one of them was not deemed to be natural language. Alternatively, they may have concatenated the columns and used both of them in training.

    Anyway, what I'm interested in knowing here is what you guys decided to do in your training. Did you use both columns or just the description?

    Also, did you clean the text in any manner (e.g. remove html tags present in the text)?

    opened by Lauler 4
  • Add ROOT to files written in gather_cc

    Add ROOT to files written in gather_cc

    Hi again,

    it would make sense to append ROOT to the filepath in the csv-file? Because after running gather_cc.py the files are in the folder cc_data (eg. cc_data/val/00/0123.jpg), but the path in the csv-file is only val/00/0123.jpg.

    BR Andreas

    opened by fuersta 3
  • _transform() got an unexpected keyword argument 'is_train'

    _transform() got an unexpected keyword argument 'is_train'

    Hi , I was trying to train this model while I got this issue "_transform() got an unexpected keyword argument 'is_train'" image

    Any insight of what might be wrong? Thanks a lot!

    opened by cyy857 3
  • `logit_scale` in `CLIP`

    `logit_scale` in `CLIP`

    Thanks for preparing this repo. I was wondering how is self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)) decided? I mean where is this value np.log(1 / 0.07)) inspired from?

    opened by sarahESL 2
  • Conceptual Captions Faster R-CNN features

    Conceptual Captions Faster R-CNN features

    Hi, A sincere request. Since it is very time taking, could you kindly provide the extracted faster R-CNN features for the conceptual captions dataset via drive or dropbox? Thanks :)

    opened by Abhiram4572 2
  • Massive GPU memory usage during evaluation

    Massive GPU memory usage during evaluation

    Machine setup

    Google cloud VM
    Debian10
    16 cores CPU, 60Gb of rams
    4 nvidia T4
    

    Error

    Traceback (most recent call last):
      File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
        fn(i, *args)
      File "/home/jupyter/open_clip/src/training/main.py", line 192, in main_worker
        evaluate(model, data, 0, args, writer, 0)
      File "/home/jupyter/open_clip/src/training/train.py", line 197, in evaluate
        torch.cat(all_image_features), torch.cat(all_text_features)
      File "/home/jupyter/open_clip/src/training/train.py", line 228, in get_metrics
        logits_per_image = image_features @ text_features.t()
    RuntimeError: CUDA out of memory. Tried to allocate 2269.88 GiB (GPU 0; 14.76 GiB total capacity; 7.11 GiB already allocated; 6.67 GiB free; 7.17 GiB reserved in total by PyTorch)
    

    The script I use :

    python -u src/training/main.py \
        --save-frequency 1 \
        --zeroshot-frequency 3 \
        --train-data "src/df_openclip_train.csv"  \
        --val-data "src/df_openclip_val.csv"  \
        --openai-pretrained \
        --csv-separator "," \
        --csv-img-key image_path \
        --csv-caption-key product_name \
        --warmup 10000 \
        --batch-size=128 \
        --lr=1e-3 \
        --wd=0.1 \
        --epochs=30 \
        --workers=4 \
        --model ViT-B/32
    

    Full setting

    2021-09-01,05:15:43 | INFO | Rank 0 | Params:
    2021-09-01,05:15:43 | INFO | Rank 0 |   C: 3.16
    2021-09-01,05:15:43 | INFO | Rank 0 |   aggregate: True
    2021-09-01,05:15:43 | INFO | Rank 0 |   batch_size: 128
    2021-09-01,05:15:43 | INFO | Rank 0 |   beta1: 0.9
    2021-09-01,05:15:43 | INFO | Rank 0 |   beta2: 0.98
    2021-09-01,05:15:43 | INFO | Rank 0 |   checkpoint_path: ./logs/lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41/checkpoints
    2021-09-01,05:15:43 | INFO | Rank 0 |   copy_codebase: False
    2021-09-01,05:15:43 | INFO | Rank 0 |   csv_caption_key: product_name
    2021-09-01,05:15:43 | INFO | Rank 0 |   csv_img_key: image_path
    2021-09-01,05:15:43 | INFO | Rank 0 |   csv_separator: ,
    2021-09-01,05:15:43 | INFO | Rank 0 |   dataset_type: auto
    2021-09-01,05:15:43 | INFO | Rank 0 |   debug: False
    2021-09-01,05:15:43 | INFO | Rank 0 |   dist_backend: nccl
    2021-09-01,05:15:43 | INFO | Rank 0 |   dist_url: tcp://127.0.0.1:6100
    2021-09-01,05:15:43 | INFO | Rank 0 |   distributed: True
    2021-09-01,05:15:43 | INFO | Rank 0 |   dp: False
    2021-09-01,05:15:43 | INFO | Rank 0 |   epochs: 30
    2021-09-01,05:15:43 | INFO | Rank 0 |   eps: 1e-06
    2021-09-01,05:15:43 | INFO | Rank 0 |   gpu: 0
    2021-09-01,05:15:43 | INFO | Rank 0 |   imagenet_v2: None
    2021-09-01,05:15:43 | INFO | Rank 0 |   imagenet_val: None
    2021-09-01,05:15:43 | INFO | Rank 0 |   log_level: 20
    2021-09-01,05:15:43 | INFO | Rank 0 |   log_path: ./logs/lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41/out.log
    2021-09-01,05:15:43 | INFO | Rank 0 |   logs: ./logs/
    2021-09-01,05:15:43 | INFO | Rank 0 |   lr: 0.001
    2021-09-01,05:15:43 | INFO | Rank 0 |   model: ViT-B/32
    2021-09-01,05:15:43 | INFO | Rank 0 |   multigpu: None
    2021-09-01,05:15:43 | INFO | Rank 0 |   name: lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41
    2021-09-01,05:15:43 | INFO | Rank 0 |   ngpus_per_node: 4
    2021-09-01,05:15:43 | INFO | Rank 0 |   openai_pretrained: True
    2021-09-01,05:15:43 | INFO | Rank 0 |   precision: amp
    2021-09-01,05:15:43 | INFO | Rank 0 |   rank: 0
    2021-09-01,05:15:43 | INFO | Rank 0 |   regression_frequency: 2
    2021-09-01,05:15:43 | INFO | Rank 0 |   report_to: 
    2021-09-01,05:15:43 | INFO | Rank 0 |   resume: None
    2021-09-01,05:15:43 | INFO | Rank 0 |   save_frequency: 1
    2021-09-01,05:15:43 | INFO | Rank 0 |   skip_aggregate: False
    2021-09-01,05:15:43 | INFO | Rank 0 |   skip_scheduler: False
    2021-09-01,05:15:43 | INFO | Rank 0 |   tensorboard: False
    2021-09-01,05:15:43 | INFO | Rank 0 |   tensorboard_path: 
    2021-09-01,05:15:43 | INFO | Rank 0 |   train_data: src/df_openclip_train.csv
    2021-09-01,05:15:43 | INFO | Rank 0 |   use_bn_sync: False
    2021-09-01,05:15:43 | INFO | Rank 0 |   val_data: src/df_openclip_val.csv
    2021-09-01,05:15:43 | INFO | Rank 0 |   wandb: False
    2021-09-01,05:15:43 | INFO | Rank 0 |   wandb_notes: 
    2021-09-01,05:15:43 | INFO | Rank 0 |   warmup: 10000
    2021-09-01,05:15:43 | INFO | Rank 0 |   wd: 0.1
    2021-09-01,05:15:43 | INFO | Rank 0 |   workers: 4
    2021-09-01,05:15:43 | INFO | Rank 0 |   world_size: 4
    2021-09-01,05:15:43 | INFO | Rank 0 |   zeroshot_frequency: 3
    2021-09-01,05:15:47 | INFO | Rank 0 | Use GPU: 0 for training
    2021-09-01,05:15:47 | INFO | Rank 1 | Use GPU: 1 for training
    2021-09-01,05:15:47 | INFO | Rank 2 | Use GPU: 2 for training
    2021-09-01,05:15:47 | INFO | Rank 3 | Use GPU: 3 for training
    

    Info about the data :

    Training data consist of 2.9 million pairs of text-image Validation data consist of 780k pairs of text-image

    Potential cause of the error

    The get_metrics function is call on whole evaluation data embedding at once, which is massive. In my cases, the matrix multiplication involving 2 matrix with size of 780k x 512 which requires 2000 Gb of GPU memory

    opened by vinson2233 2
  • Results of using different learning rates and more training epochs

    Results of using different learning rates and more training epochs

    Very nice code!

    I'm able to reproduce the zero-shot results on imagenet using cc3m (2,862,387 images in total for me) and the provided sample code.

    I'd like to ask if you have tried different learning rates other than 1e-3 for batch=128? Would you be able to give more insights on how you ended up using lr=1e-3?

    Also, I'd like to know if you have tried more training epochs, i.e. larger than 30. I'm curious if training with more epochs would help improve the zero-shot accuracy.

    opened by KaiyangZhou 2
  • Avenue for exploration - augmenting training set with colour palettes / texture names /more meta data

    Avenue for exploration - augmenting training set with colour palettes / texture names /more meta data

    so part of the fun with clip is using it in conjunction with VQGAN. This allows the prompts to generate images.

    There's something lost in this translation. though . They say a picture is worth a 1000 words - but what if some extra data was injected into the training ?

    could be say textures / maybe even geometric descptions / meta data

    opened by johndpope 1
  • Passing --imagenet-val (or --imagenet-v2) without --val crashes unnecessarily

    Passing --imagenet-val (or --imagenet-v2) without --val crashes unnecessarily

    In the current repository, you can evaluate a pretrained model by running

    python src/training/main.py \
        --val-data="/path/to/validation_data.csv"  \
        --resume /path/to/checkpoints/epoch_K.pt
    

    However, if you try to do the same thing and just try to get the imagenet-val (or imagenet-v2) accuracy

    python src/training/main.py \
        --imagenet-val="/path/to/imagenet/val"  \
        --resume /path/to/checkpoints/epoch_K.pt
    

    then it crashes:

    Traceback (most recent call last):
      File "src/training/main.py", line 307, in <module>
        main()
      File "src/training/main.py", line 296, in main
        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, log_queue, args))
      File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
        return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
      File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
        while not context.join():
      File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
        raise Exception(msg)
    Exception: 
    
    -- Process 0 terminated with the following error:
    Traceback (most recent call last):
      File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
        fn(i, *args)
      File "/home/ncarlini/open_clip/src/training/main.py", line 189, in main_worker
        evaluate(model, data, start_epoch, args, writer, 0)
      File "/home/ncarlini/open_clip/src/training/train.py", line 159, in evaluate
        dataloader = data['val'].dataloader
    KeyError: 'val'
    

    It should be allowed to get imagenet accuracy without getting using a val dataset.

    enhancement good first issue 
    opened by carlini 0
  • Add option for zero-shot on ImageNetR, Sketch, etc...

    Add option for zero-shot on ImageNetR, Sketch, etc...

    null

    enhancement good first issue 
    opened by mitchellnw 1
  • CLIP training in Jax.

    CLIP training in Jax.

    Would be nice if we could add a jax_src folder which supported training CLIP models in Jax.

    This would also help with https://github.com/mlfoundations/open_clip/issues/20.

    enhancement good first issue 
    opened by mitchellnw 0
  • TPU support.

    TPU support.

    Would be nice if this repo supported training on TPUs.

    enhancement good first issue 
    opened by mitchellnw 0
  • Loss is constant

    Loss is constant

    I'm using CLIP to train on my custom dataset with the following params:

    Dataset size : 50k image-text pairs Batch size : 128 Image Size : 224 Gpus : 1 Epochs : 500

    It's been running for a while now, I'm on my 15th epoch, and the loss hasn't changed at all. It isn't a constant number, but its constantly at 4.8xxx. Should I be concerned? I'm not sure why this is happening.

    image

    opened by tarunn2799 13
  • Expected time/epoch for conceptual captions (R50)

    Expected time/epoch for conceptual captions (R50)

    How long is a reasonable time for an epoch using 8 workers? I'm seeing about 8 hours/epoch, for the resnet50. Launch command from the README:

        --save-frequency 1 \
        --zeroshot-frequency 1 \
        --report-to tensorboard \
        --train-data="/path/to/train_data.csv"  \
        --val-data="/path/to/validation_data.csv"  \
        --csv-img-key filepath \
        --csv-caption-key title \
        --imagenet-val=/path/to/imagenet/root/val/ \
        --warmup 10000 \
        --batch-size=128 \
        --lr=1e-3 \
        --wd=0.1 \
        --epochs=30 \
        --workers=8 \
        --model RN50
    

    Thank you!

    opened by piotr-teterwak 1
  • Performance of VIT-B/32 is worse than RN50 on CC3M

    Performance of VIT-B/32 is worse than RN50 on CC3M

    Here are my curves. RN50 roughly matchs the one shown in the repo, but the VITB/32 is worse. I am using the hyperparams from README. I am wondering could you also share the performance curves of ViTB/32 on CC? Screen Shot 2021-09-08 at 12 39 26 PM

    opened by JACKHAHA363 1
  • training perf for single GPU is not good

    training perf for single GPU is not good

    Hi, I was training clip using single GPU. After profiling, I noticed that the perf of CLIP training was not good, as we can see from the picture below. GPU idle time is almost twice of GPU active due to the sem_timedwait as blocked in CPU. Any idea we can solve this unnecessary block? Thanks! image

    opened by cyy857 4
  • scripts of training on multiple nodes

    scripts of training on multiple nodes

    Hi, is there any easy-using script for training clip on multiple nodes? I can set up training on one node(8GPUs) now. But I need to test the scaling efficient. Thanks for any insight~

    enhancement good first issue 
    opened by cyy857 1
Releases(v0.1)
  • v0.1(Jul 28, 2021)

    Welcome to the initial release of open_clip, an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training).

    The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset.

    Source code(tar.gz)
    Source code(zip)
Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

CLIP-Guided-Diffusion Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab. Original colab notebooks by Ka

Nerdy Rodent 38 Oct 23, 2021
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 138 Oct 19, 2021
An open source implementation of CLIP.

OpenCLIP Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training). The goal of this repository is to enable

null 322 Oct 22, 2021
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

Kevin Costa 44 Oct 18, 2021
Simple implementation of OpenAI CLIP model in PyTorch.

It was in January of 2021 that OpenAI announced two new models: DALL-E and CLIP, both multi-modality models connecting texts and images in some way. In this article we are going to implement CLIP model from scratch in PyTorch. OpenAI has open-sourced some of the code relating to CLIP model but I found it intimidating and it was far from something short and simple. I also came across a good tutorial inspired by CLIP model on Keras code examples and I translated some parts of it into PyTorch to build this tutorial totally with our beloved PyTorch!

Moein Shariatnia 81 Oct 16, 2021
CLIPort: What and Where Pathways for Robotic Manipulation

CLIPort CLIPort: What and Where Pathways for Robotic Manipulation Mohit Shridhar, Lucas Manuelli, Dieter Fox CoRL 2021 CLIPort is an end-to-end imitat

null 84 Oct 15, 2021
The official PyTorch implementation of the paper: *Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." *.

F-Clip — Fully Convolutional Line Parsing This repository contains the official PyTorch implementation of the paper: *Xili Dai, Xiaojun Yuan, Haigang

Xili Dai 67 Sep 23, 2021
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Hao Tan 56 Oct 16, 2021
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

Federico Galatolo 128 Oct 9, 2021
Learning to Prompt for Vision-Language Models.

CoOp Paper: Learning to Prompt for Vision-Language Models Authors: Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu CoOp (Context Optimization)

Kaiyang 167 Oct 21, 2021
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

DALL-E in Pytorch Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch. It will also contain CLIP for ranking the ge

Phil Wang 3.5k Oct 22, 2021
A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

train-CLIP ?? A PyTorch Lightning solution to training CLIP from scratch. Goal ⚽ Our aim is to create an easy to use Lightning implementation of OpenA

Cade Gordon 169 Oct 24, 2021
Navigating StyleGAN2 w latent space using CLIP

Navigating StyleGAN2 w latent space using CLIP an attempt to build sth with the official SG2-ADA Pytorch impl kinda inspired by Generating Images from

Mike K. 42 Oct 13, 2021
Streamlit Tutorial (ex: stock price dashboard, cartoon-stylegan, vqgan-clip, stylemixing, styleclip, sefa)

Streamlit Tutorials Install pip install streamlit Run cd [directory] streamlit run app.py --server.address 0.0.0.0 --server.port [your port] # http:/

Jihye Back 7 Oct 15, 2021
Self-Supervised Contrastive Learning of Music Spectrograms

Self-Supervised Music Analysis Self-Supervised Contrastive Learning of Music Spectrograms Dataset Songs on the Billboard Year End Hot 100 were collect

null 26 Oct 13, 2021
CLIP+FFT text-to-image

Aphantasia This is a text-to-image tool, part of the artwork of the same name. Based on CLIP model, with FFT parameterizer from Lucent library as a ge

vadim epstein 284 Oct 14, 2021
This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

Multimodal Deep Learning ?? ?? ?? Announcing the multimodal deep learning repository that contains implementation of various deep learning-based model

Deep Cognition and Language Research (DeCLaRe) Lab 58 Oct 14, 2021
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.3k Oct 15, 2021
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

null 247 Oct 16, 2021