CLIP (Contrastive Language–Image Pre-training) for Italian

Overview

Italian CLIP

Youtube Video HuggingFace Spaces Open In Colab Medium Blog Post

CLIP (Radford et al., 2021) is a multimodal model that can learn to represent images and text jointly in the same space.

In this project, we aim to propose the first CLIP model trained on Italian data, that in this context can be considered a low resource language. Using a few techniques, we have been able to fine-tune a SOTA Italian CLIP model with only 1.4 million training samples. Our Italian CLIP model is built upon the pre-trained Italian BERT model provided by dbmdz and the OpenAI vision transformer.

In building this project we kept in mind the following principles:

  • Novel Contributions: We created an impressive dataset of ~1.4 million Italian image-text pairs (that we will share with the community) and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
  • Scientific Validity: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models on two tasks and made the validation reproducible for everybody.
  • Broader Outlook: We always kept in mind which are the possible usages and limitations of this model.

We put our hearts and souls into the project during this week! Not only did we work on a cool project, but we were able to make new friends and learn a lot from each other to work towards a common goal! Thank you for this amazing opportunity, we hope you will like the results! ❤️

Pre-print available here

@article{bianchi2021contrastive,
  title={Contrastive Language-Image Pre-training for the Italian Language},
  author={Bianchi, Federico and Attanasio, Giuseppe and Pisoni, Raphael and Terragni, Silvia and Sarti, Gabriele and Lakshmi, Sri},
  journal={arXiv preprint arXiv:2108.08688},
  year={2021}
}

HuggingFace Spaces demo available here.

What you will find in the demo:

  • Text to Image: This task is essentially an image retrieval task. The user is asked to input a string of text and CLIP is going to compute the similarity between this string of text with respect to a set of images. The webapp is going to display the images that have the highest similarity with the text query.

text2image

  • Image to Text: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.

image2text

  • Localization: This is a very cool feature 😎 and at the best of our knowledge, it is a novel contribution. We can use CLIP to find where "something" (like a "cat") is in an image. The location of the object is computed by masking different areas of the image and looking at how the similarity to the image description changes.

localization localization2

Novel Contributions

The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian. We indeed worked in a low-resource setting. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT. To get competitive results, we followed three strategies:

  1. more and better data;
  2. better augmentations;
  3. better training strategies.

For those interested, we have a ☄️ Comet report that shows a subset of the experiments we ran. Different hyper-parameters played a role in reducing the validation loss. The optimizer we used gave us great performance and fast convergence, more data and augmentations helped a lot in generalizing, working on the training and on the loss gave us the final increase that you can see in the results.

More and Better Data

We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP. Thus, we tried to add as much data as possible while keeping the data-quality as high as possible.

We considered four main sources of data:

  • WIT is an image-caption dataset collected from Wikipedia (see, Srinivasan et al., 2021). We focused on the Reference Description captions described in the paper as they are the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994). However, this kind of text, without more information, is not useful to learn a good mapping between images and captions. To prevent polluting the data with captions that are not meaningful, we used POS tagging on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much of the dataset, without introducing noise.

    Captions like *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.

  • MSCOCO-IT. This image-caption dataset comes from the work by Scaiella et al., 2019. The captions come from the original MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than 100K images, for each image more than one caption is available.

  • Conceptual Captions. This image-caption dataset comes from the work by Sharma et al., 2018. There are more than 3mln image-caption pairs in this dataset that have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect a dataset with 700K translated captions.

  • La Foto del Giorno. This image-caption dataset is collected from Il Post, a prominent Italian online newspaper. The collection contains almost 30K pairs: starting from early 2011, for each day, editors at Il Post pick several images picturing the most salient events in the world. Each photo comes along with an Italian caption.

A Note on Translations

Instead of relying on open-source translators, we decided to use DeepL. Translation quality of the data was the main reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource, but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality.

Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4. The meaning of the value is as follows: 1, the sentence has lost is meaning, or it's not possible to understand it; 2, it is possible to get the idea but there is something wrong; 3, good, however a native speaker might complain about some translations; 4, good translation.

The average score was of 3.78, and the three annotators had an inter-rater agreement - computed with Gwet's AC1 using ordinal weighting - of 0.858 (great agreement!).

English Captions Italian Captions
an endless cargo of tanks on a train pulled down tracks in an empty dry landscape un carico infinito di carri armati su un treno trascinato lungo i binari in un paesaggio secco e vuoto
person walking down the aisle persona che cammina lungo la navata
popular rides at night at the county fair giostre popolari di notte alla fiera della contea

If the table above doesn't show, you can have a look at it here.

We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so that those interested can check the quality. The Google Sheet is here.

Better Augmentations

We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefore, we implemented heavy augmentations to make the training more data efficient. They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however, to still give the model the ability to learn color definitions.
While we would have liked to have augmentations for the captions as well, after some experimentation we settled with random sampling from the five captions available in MSCOCO and leaving the rest of the captions unmodified.

Better Training

After different trials, we realized that the usual way of training this model was not good enough to get good results. We thus modified three different parts of the training pipeline: the optimizer, the training with frozen components, and the fixed logit_scale parameter.

Optimizer

While the initial code used AdamW as an optimizer, we soon noticed that it introduced some bad properties into the training. The model strated to overfit relatively quickly and the weight decay made this effect worse. We eventually decided to use an optimization strategy that had worked well for us in similar cases and used AdaBelief with Adaptive Gradient Clipping (AGC) and a Cosine Annealing Schedule. Together with slightly tuning the learning rate this helped us to reduce the validation loss by more than 25%. Our implementation is available online here.

Backbone Freezing

The ViT used by OpenAI was already trained on 400 million images, and it is the element in our architecture that probably requires the least amount of training. The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfroze the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.

backbone_freezing

Logit Scale

We tried to improve the loss function in different ways: for example, we tried something similar to a margin based loss but that experiments did not yield the results we hoped for. Eventually, the thing that worked out the best was fixing the logit_scale value to 20. This value is used after the computation of the similarity between the images and the texts in CLIP (see the code here). We got this idea from Nils' video on sentence embeddings.

Effect of Our Edits

The following picture showcases the effect that these edits have had on our evaluation loss:

effects_edits

The purple line is the original training without any of our improvements: you can see that we needed a lot of training steps to get the loss down. Yellow line is the loss with the new optimizer, it is striking to see the time we save from this addition! Not only the loss improves, it also converges significantly faster! The blue line shows the results when fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy, and you can see the results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element to reduce the loss.

Scientific Validity

We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is in fact good. We then show some qualitative examples of images found by the model. All the code we have written to run our validation experiments (in combination with code made available by Nils Reimers and by the authors of the original CLIP) is available.

Training Details

Datasets Splits

We tried different combinations of splits sizes for training and validation. Eventually, we focused on a 95% training split with 5% of data going into the validation, each dataset is split in training and validation data and then we concatenate the files. Note that the 5% means 70K validation samples, making this set almost as big as the MSCOCO dataset.

Hyper-parameters

The hyper-parameters can be found in the repository. We have a maximum sequence length of 95 tokens. To compute this we look at the distribution of the captions in the various datasets and we eventually realized that 95 was an excellent compromise between training speed and data coverage. We use a batch size of 128 and a learning rate of 0.00001.

Training

We usually train until we see the loss going up and we then pick the model with the best validation loss. We adjusted the number of training epochs as the project progressed: at first we run 100 epochs but after we replaced the optimizer we have been able to reduce this number.

Quantitative Evaluation

Showing great images is definitely cool and interesting, but a model is nothing without validation. Since this is the first clip-based model in Italian, we decided to use the multilingual CLIP model as a comparison baseline.

mCLIP

The multilingual CLIP (henceforth, mCLIP), is a model introduced by Nils Reimers in his sentence-transformer library. mCLIP is based on a multilingual encoder that was created through multilingual knowledge distillation (see Reimers et al., 2020). It shows great capabilities in representing multilingual text in the same space of the images.

Tasks

We selected two different tasks:

  • image-retrieval, in which given a caption the model finds the most semantically similar image
  • zero-shot classification, in which given an image and a set of captions (or labels), the model finds the best matching caption for the image

Reproducibility

In order to make both experiments very easy to replicate, we share the colab notebooks we used to compute the results.

Image Retrieval

This experiment is run against the MSCOCO-IT validation set (that we haven't used during training). Given an input caption from the dataset, we search for the most similar image in the MSCOCO-IT validation set and check if this is the one that was described by the original caption. As evaluation metrics we use the MRR@K.

MRR CLIP-Italian mCLIP
MRR@1 0.3797 0.2874
MRR@5 0.5039 0.3957
MRR@10 0.5204 0.4129

If the table above doesn't show, you can have a look at it here.

It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained on 400million images (and some of them might have been from MSCOCO).

Zero-shot image classification

This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet. To do this, we used DeepL to automatically translate the image labels in ImageNet. No manual engineering of the labels or prompts was done. We evaluate the models computing the accuracy at different levels.

Accuracy CLIP-Italian mCLIP
Accuracy@1 22.11 20.15
Accuracy@5 43.69 36.57
Accuracy@10 52.55 42.91
Accuracy@100 81.08 67.11

If the table above doesn't show, you can have a look at it here.

Discussion

Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task we have been testing. Note, however, that our results are lower than those shown in the original OpenAI paper (see, Radford et al., 2021) that was trained and evaluated on English data. However, considering that our results are in line with those obtained by mCLIP we think that the translated image labels most probably had an impact on the final scores.

Qualitative Evaluation

We hereby show some interesting properties of the model. One is its ability to detect colors, then there is its (partial) counting ability and finally the ability of understanding more complex queries. You can find more examples in the "Gallery" section of the demo.

To our own surprise, many of the answers the model gives make a lot of sense! Note that the model, in this case, is searching the right image from a set of 25K images from an Unsplash dataset.

Look at the following - slightly cherry picked - examples:

Colors

Here's "a yellow flower"

yellow flower

And here's "a blue flower"

blue flower

Counting

What about "one cat"?

one cat

And what about "two cats"?

two cats

Complex Queries

Have you ever seen "two brown horses"?

two brown horses

And finally, here's a very nice "cat on a chair"

cat on a chair

Broader Outlook

We believe that this model can be useful for many different applications. From image classification to clustering, a model like our Italian CLIP can be used to support researchers and practitioners in many different tasks. Indeed, not only can it be useful in research, but also in industry. A very interesting use-case is given by ecommerce platforms: these website often deal with a main source of text that is the query engine and with lots of images of the products. CLIP Italian can be a killer app in this context, providing a way to search for images and text. Nonetheless, Italy has many different collections of photos in digital format that are difficult to categorize efficiently. For example, the Istituto Luce Cinecittà is an Italian governative entity that collects photos of Italy since the early 1900 and is part of the largest movie studios in Europe (Cinecittà). A semantic way of finding images in their catalog could be an amazing use case.

Limitations and Bias

Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model finds difficult to count after three; this is a general limitation that is common to many models of this type.

There are even more evident issues that we found in our model. Due to the unfiltered nature of our training data, the model is exposed to many biases such as sexism, racism, stereotypes, slurs, and gore that it might replicate without the awareness of their hurtful and harmful nature. Indeed, different BERT models - Italian ones included - are prone to create stereotyped sentences that are hurtful (Nozza et al., 2021). While this is not something we intended, it certainly is something that we share the blame for since we were not able to avoid it.

Unfortunately, these kinds of issues are common to many machine learning algorithms (check Abit et al., 2021 for bias in GPT-3 as an example). This suggests we need to find better approaches to counteract this problem that affects our society.

Useful Links

References

Abid, A., Farooqi, M., & Zou, J. (2021). Persistent anti-muslim bias in large language models. arXiv preprint arXiv:2101.05783.

Gwet, K. L. (2008). Computing inter‐rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29-48.

Nozza, D., Bianchi, F., & Hovy, D. (2021, June). HONEST: Measuring hurtful sentence completion in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 2398-2406).

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.

Reimers, N., & Gurevych, I. (2020, November). Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).

Scaiella, A., Croce, D., & Basili, R. (2019). Large scale datasets for Image and Video Captioning in Italian. IJCoL. Italian Journal of Computational Linguistics, 5(5-2), 49-60.

Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2556-2565).

Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning. arXiv preprint arXiv:2103.01913.

Other Notes

This readme has been designed using resources from Flaticon.com

Comments
  • Suggestion: Product Embedding Creation for e-commerce domain.

    Suggestion: Product Embedding Creation for e-commerce domain.

    Hi, I need a suggestion to build Product Embeddings for the E-commerce domain. I am working on a product embedding creation and the requirement is to create a single embedding for each product using the Product's Image, description, Price, and brand as Features. If you can provide some direction,it would help to process and understand the things properly.

    Thanks.

    opened by karndeepsingh 11
  • RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter

    RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter

    I am facing the warmup_ratio and warmup_steps error even though I have mentioned in the CLI parameter.

    !python run_hybrid_clip.py \
        --output_dir ${MODEL_DIR} \
        --text_model_name_or_path="bertin-project/bertin-roberta-base-spanish" \
        --vision_model_name_or_path="openai/clip-vit-base-patch32" \
        --tokenizer_name="bertin-project/bertin-roberta-base-spanish" \
        --train_file="/content/drive/MyDrive/train_dataset.json" \
        --validation_file="/content/drive/MyDrive/valid_dataset.json" \
        --do_train --do_eval \
        --num_train_epochs="40" --max_seq_length 96 \
        --per_device_train_batch_size="64" \
        --per_device_eval_batch_size="64" \
        --learning_rate="5e-5" \
        --warmup_steps "0" \
        --warmup_ratio 0.0 \
        --weight_decay 0.1 \
        --overwrite_output_dir \
        --preprocessing_num_workers 32 \
    
    
    loading weights file https://huggingface.co/openai/clip-vit-base-patch32/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/8a82711445c5200c2b4fd30739df371f5b3ce2d7e316418d58dd290bae1c1cc8.dabcc684421296ebcdafd583a4415c1757ae007787f2d0e17b87482d9b8cf197
    Loading PyTorch weights from /root/.cache/huggingface/transformers/8a82711445c5200c2b4fd30739df371f5b3ce2d7e316418d58dd290bae1c1cc8.dabcc684421296ebcdafd583a4415c1757ae007787f2d0e17b87482d9b8cf197
    PyTorch checkpoint contains 151,277,440 parameters.
    Some weights of the model checkpoint at openai/clip-vit-base-patch32 were not used when initializing FlaxCLIPModel: {('text_model', 'embeddings', 'position_ids'), ('vision_model', 'embeddings', 'position_ids')}
    - This IS expected if you are initializing FlaxCLIPModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
    - This IS NOT expected if you are initializing FlaxCLIPModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    All the weights of FlaxCLIPModel were initialized from the model checkpoint at openai/clip-vit-base-patch32.
    If your task is similar to the task the model of the checkpoint was trained on, you can already use FlaxCLIPModel for predictions without further training.
    text_config_dict is None. Initializing the CLIPTextConfig with default values.
    vision_config_dict is None. initializing the CLIPVisionConfig with default values.
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:490: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
      cpuset_checked))
    2022-06-30 11:27:58.559519: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
    Traceback (most recent call last):
      File "run_hybrid_clip.py", line 975, in <module>
        main()
      File "run_hybrid_clip.py", line 716, in main
        "You have to specify either the warmup_steps or warmup_ratio CLI parameter"
    RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter
    
    help wanted question 
    opened by karndeepsingh 9
  • Lenght of the text to be embedded

    Lenght of the text to be embedded

    Hi,

    great work! Just a question: Is there a maximum length for the text we want to embed? I know that CLIP takes a text with a maximum number of 76 tokens.

    Thanks, Enrico

    opened by enrico310786 8
  • ValueError: Unrecognized model identifier: clip_vision_model.

    ValueError: Unrecognized model identifier: clip_vision_model.

    using instructions at: https://github.com/clip-italian/clip-italian/tree/master/hybrid_clip https://github.com/clip-italian/clip-italian/blob/master/evaluation/CLIP_Image_Retrieval_and_MRR.ipynb

    i get the error ValueError: Unrecognized model identifier: clip_vision_model. when calling code with my trained model

    model = FlaxHybridCLIP.from_pretrained("TheLitttleThings/DiffusionTest1")
    

    or the clip-italian model

    model = FlaxHybridCLIP.from_pretrained("clip-italian/clip-italian")
    

    this happens when loading a model i trained or clip-italian/clip-italian

    full stack

    ValueError                                Traceback (most recent call last)
    [<ipython-input-15-8371e62a0ea2>](https://sr6ftx9ndfs-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220801-060051-RC01_464575252#) in <module>()
         12     TOKENIZER_NAME = "dbmdz/bert-base-italian-xxl-uncased"
         13     tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME, cache_dir=None, use_fast=True)
    ---> 14     model = FlaxHybridCLIP.from_pretrained("clip-italian/clip-italian")
         15     def tokenize(texts):
         16         inputs = tokenizer(texts, max_length=96, padding="max_length", return_tensors="np")
    
    4 frames
    [/usr/local/lib/python3.7/dist-packages/transformers/modeling_flax_utils.py](https://sr6ftx9ndfs-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220801-060051-RC01_464575252#) in from_pretrained(cls, pretrained_model_name_or_path, dtype, *model_args, **kwargs)
        622                 _from_auto=from_auto_class,
        623                 _from_pipeline=from_pipeline,
    --> 624                 **kwargs,
        625             )
        626         else:
    
    [/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py](https://sr6ftx9ndfs-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220801-060051-RC01_464575252#) in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
        539             )
        540 
    --> 541         return cls.from_dict(config_dict, **kwargs)
        542 
        543     @classmethod
    
    [/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py](https://sr6ftx9ndfs-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220801-060051-RC01_464575252#) in from_dict(cls, config_dict, **kwargs)
        697         kwargs.pop("_from_pipeline", None)
        698 
    --> 699         config = cls(**config_dict)
        700 
        701         if hasattr(config, "pruned_heads"):
    
    [/content/clip-italian/hybrid_clip/configuration_hybrid_clip.py](https://sr6ftx9ndfs-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220801-060051-RC01_464575252#) in __init__(self, projection_dim, **kwargs)
         77             self.vision_config = AutoConfig.for_model(vision_model_type, **vision_config).vision_config
         78         else:
    ---> 79             self.vision_config = AutoConfig.for_model(vision_model_type, **vision_config)
         80 
         81         self.projection_dim = projection_dim
    
    [/usr/local/lib/python3.7/dist-packages/transformers/models/auto/configuration_auto.py](https://sr6ftx9ndfs-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220801-060051-RC01_464575252#) in for_model(cls, model_type, *args, **kwargs)
        632             return config_class(*args, **kwargs)
        633         raise ValueError(
    --> 634             f"Unrecognized model identifier: {model_type}. Should contain one of {', '.join(CONFIG_MAPPING.keys())}"
        635         )
        636 
    
    ValueError: Unrecognized model identifier: clip_vision_model. Should contain one of albert, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, blenderbot, blenderbot-small, bloom, camembert, canine, clip, codegen, convbert, convnext, ctrl, cvt, data2vec-audio, data2vec-text, data2vec-vision, deberta, deberta-v2, decision_transformer, deit, detr, distilbert, dpr, dpt, electra, encoder-decoder, flaubert, flava, fnet, fsmt, funnel, glpn, gpt2, gpt_neo, gpt_neox, gptj, groupvit, hubert, ibert, imagegpt, layoutlm, layoutlmv2, layoutlmv3, led, levit, longformer, longt5, luke, lxmert, m2m_100, marian, maskformer, mbart, mctct, megatron-bert, mobilebert, mobilevit, mpnet, mt5, mvp, nezha, nystromformer, openai-gpt, opt, owlvit, pegasus, perceiver, plbart, poolformer, prophetnet, qdqbert, rag, realm, reformer, regnet, rembert, resnet, retribert, roberta, roformer, segformer, sew, sew-d, speech-encoder-decoder, speech_to_text, speech_to_text_2, splinter, squeezebert, swin, t5, tapas, trajectory_transformer, transfo-xl, trocr, unispeech, unispeech-sat, van, vilt, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_mae, wav2vec2, wav2vec2-conformer, wavlm, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, yolos, yoso
    
    opened by antitheos 5
  • Training Script not Generating Tokenizer Files

    Training Script not Generating Tokenizer Files

    opened by antitheos 4
  • How to pass test images for sentence retrieval?

    How to pass test images for sentence retrieval?

    Hi there, I am trying to use the colab notebook for text retrieval using image. I want to pass my own set of images and text to the model. How can i do that? Thanks in advance

    opened by architlatkar27 4
  • ERORR:

    ERORR: "Missing XLA Configuration" while running the script?

    Hi, I was trying to train the clip model on the images and text and used run_hybrid_clip.py script to train but got following error. Please help me to remove the following error. I am trying to train it on GPU device, it seems that error is due to torch_xla which is on TPU. Please help me to train it on GPU.

    !python run_hybrid_clip.py \
        --output_dir ${MODEL_DIR} \
        --text_model_name_or_path="/home/jupyter/HUSE/sentence_similarity_spanish_es" \
        --vision_model_name_or_path="openai/clip-vit-base-patch32" \
        --tokenizer_name="sentence_similarity_spanish_es" \
        --train_file="./train_dataset.json" \
        --validation_file="./valid_dataset.json" \
        --do_train --do_eval \
        --num_train_epochs="40" --max_seq_length 512 \
        --per_device_train_batch_size="64" \
        --per_device_eval_batch_size="64" \
        --learning_rate="5e-5" --warmup_steps="0" --weight_decay 0.1 \
        --overwrite_output_dir \
        --preprocessing_num_workers 32 \
    
    
    comet_ml is installed but `COMET_API_KEY` is not set.
    Traceback (most recent call last):
      File "run_hybrid_clip.py", line 832, in <module>
        main()
      File "run_hybrid_clip.py", line 472, in main
        ) = parser.parse_args_into_dataclasses()
      File "/opt/conda/lib/python3.7/site-packages/transformers/hf_argparser.py", line 214, in parse_args_into_dataclasses
        obj = dtype(**inputs)
      File "<string>", line 101, in __init__
      File "/opt/conda/lib/python3.7/site-packages/transformers/training_args.py", line 1066, in __post_init__
        and (self.device.type != "cuda")
      File "/opt/conda/lib/python3.7/site-packages/transformers/utils/import_utils.py", line 829, in wrapper
        return func(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/transformers/training_args.py", line 1357, in device
        return self._setup_devices
      File "/opt/conda/lib/python3.7/site-packages/transformers/utils/generic.py", line 49, in __get__
        cached = self.fget(obj)
      File "/opt/conda/lib/python3.7/site-packages/transformers/utils/import_utils.py", line 829, in wrapper
        return func(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/transformers/training_args.py", line 1299, in _setup_devices
        device = xm.xla_device()
      File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 232, in xla_device
        devkind=devkind if devkind is not None else None)
      File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 137, in get_xla_supported_devices
        xla_devices = _DEVICES.value
      File "/opt/conda/lib/python3.7/site-packages/torch_xla/utils/utils.py", line 32, in value
        self._value = self._gen_fn()
      File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 19, in <lambda>
        _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
    RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:273 : Missing XLA configuration
    
    
    opened by karndeepsingh 3
  • exporting finetuned model to onnx?

    exporting finetuned model to onnx?

    Hi @vinid @g8a9 , thanks for sharing such a great explanation. I was wondering if there's any way to export this finetuned model into onnx? Would really appreciate your effort.

    opened by RaiAmanRai 1
  • Differences from Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

    Differences from Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

    Awesome Work! I have read your blog, but the translation part confused me a little bit. The paper "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation", adopts distillation to enforce the translated textual embedding for the produced by "student model" to be same with the embedding produced by "teacher model" for source English texts. As you did not introduce new (image text) pair, but only translations of original English text, may be this work will end up the same place as the beforementioned method?

    opened by ZiboZ 1
  • Can't include model as per HuggingFace instructions

    Can't include model as per HuggingFace instructions

    Hello, just trying to follow instructions posted on HuggingFace: from transformers import AutoTokenizer, HybridCLIP tokenizer = AutoTokenizer.from_pretrained("clip-italian/clip-italian") model = HybridCLIP.from_pretrained("clip-italian/clip-italian")

    on a fresh installation of transformers library.

    Here the result:


    ImportError Traceback (most recent call last) in ----> 1 from transformers import AutoTokenizer, HybridCLIP 2 tokenizer = AutoTokenizer.from_pretrained("clip-italian/clip-italian") 3 model = HybridCLIP.from_pretrained("clip-italian/clip-italian")

    ImportError: cannot import name 'HybridCLIP'

    opened by amessina71 1
  • All the captions + New training setting + Fixed scaling factor

    All the captions + New training setting + Fixed scaling factor

    This adds the support to random sample captions, the new optimizer and overall training setting, and the use of a fixed scaling factor (C=20) for cosine similarities.

    opened by g8a9 0
Owner
Italian CLIP
Italian CLIP project for 🤗 Flax/Jax Community Week
Italian CLIP
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

CLIP-GEN [简体中文][English] 本项目在萤火二号集群上用 PyTorch 实现了论文 《CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP》。 CLIP-GEN 是一个 Language-F

null 75 Dec 29, 2022
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

Sense-GVT 470 Dec 30, 2022
Saeed Lotfi 28 Dec 12, 2022
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Salesforce 1.3k Dec 31, 2022
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

Gowthami Somepalli 284 Dec 21, 2022
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Hao Tan 74 Dec 3, 2022
Code of our paper "Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning"

CCOP Code of our paper Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning Requirement Install OpenSelfSup Install Detectron2

Chenhongyi Yang 21 Dec 13, 2022
Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

Meta Research 621 Dec 31, 2022
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 2, 2022
CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Language–Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

Myeongjun Kim 52 Jan 7, 2023
[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

Multimedia Research 196 Dec 13, 2022
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

null 248 Dec 4, 2022
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

null 250 Jan 8, 2023
PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

TUPE PyTorch implementation of Rethinking Positional Encoding in Language Pre-training. Quickstart Clone this repository. git clone https://github.com

Jake Tae 5 Jan 27, 2022
X-VLM: Multi-Grained Vision Language Pre-Training

X-VLM: learning multi-grained vision language alignments Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xi

Yan Zeng 286 Dec 23, 2022
SAS: Self-Augmentation Strategy for Language Model Pre-training

SAS: Self-Augmentation Strategy for Language Model Pre-training This repository

Alibaba 5 Nov 2, 2022
A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

train-CLIP ?? A PyTorch Lightning solution to training CLIP from scratch. Goal ⚽ Our aim is to create an easy to use Lightning implementation of OpenA

Cade Gordon 396 Dec 30, 2022
Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)

Noise Contrastive Estimation for pyTorch Overview This repository contains a re-implementation of the Noise Contrastive Estimation algorithm, implemen

Denis Emelin 42 Nov 24, 2022