KoCLIP: Korean port of OpenAI CLIP, in Flax

Jake Tae

Last update: Jan 2, 2023

Related tags

Overview

KoCLIP

This repository contains code for KoCLIP, a Korean port of OpenAI's CLIP. This project was conducted as part of Hugging Face's Flax/JAX community week co-organized with Google's Flax, JAX, and Cloud teams (announcement).

Demo

Check out our Streamlit app here. The demo illustrates three potential uses cases of KoCLIP on different downstream tasks:

Image to Text: This is essentially a zero-shot image classification task. Given an input image, the models finds the most likely caption among the text labels provided.
Text to Image: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrieve the image that best matches given text.
Text to Patch: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.

Quickstart

To follow along the code snippets below, we recommend that you refer to the Colab notebook.

Import dependencies and initialize a KoCLIP model along with its processor.

import requests
import jax
from PIL import Image

from koclip import load_koclip

model, processor = load_koclip("koclip-base")

Prepare image and text captions.

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = ["소파 위에 고양이", "강아지와 강아지 주인", "쳇바퀴를 달리는 햄스터", "자동차"]
image

Run inference.

inputs = processor(
    text=text,
    images=image, 
    return_tensors="jax", # could also be "pt" 
    padding=True
)

outputs = model(**inputs)
probs = jax.nn.softmax(outputs.logits_per_image, axis=1)

for idx, prob in sorted(enumerate(*probs), key=lambda x: x[1], reverse=True):
    print(text[idx], prob)

Models

We trained a total of two models, koclip-base and koclip-large. Both models use RoBERTa-large. The decision to use a somewhat large language model was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to good multimodal pipeline given limited data.

KoCLIP	LM	ViT
`koclip-base`	`klue/roberta-large`	`openai/clip-vit-base-patch32`
`koclip-large`	`klue/roberta-large`	`google/vit-large-patch16-224`

Training

KoCLIP was fine-tuned using 82,783 images from the MSCOCO 2014 image captioning dataset. Korean translations of image captions were obtained from AI Hub, an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40,000 images from the validation set of the aforementioned dataset.

KoCLIP was trained on a TPU3-v8 VM. Both text and image encoder backbones were loaded from their pretrained checkpoints. KoCLIP was trained to maximize the similarity score between matching pairs of images and captions.

Findings

In this section, we detail some interesting findings we made throughout the project.

Prompting

We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as

이것은 {{}} 이다.

noticably helped the model produce more reliable results. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.

Multilinguality

Although KoCLIP was trained exclusively on a Korean dataset, we found that English queries also work surprisingly well for simple words (e.g. "dog", "car"). This could be one of two reasons, or a combination thereof:

ViT Pretraining: The ViT backbone for koclip-base, openai/clip-vit-base-patch32, was already pretrained on an English dataset. Hence, it is possible that its embeddings still lie in a latent space where vector arithematic can be performed with English text embeddings. One reason against this hypothesis is that koclip-large also demonstrates similar multilingual behavior.
LM Knowledge Bleed: klue/roberta-large was trained on a large corpus of Korean text in a self-supervised fashion. One might reasonably suspect that English words were included in parts of the corpus, especially given the high frequency of English word transliterations in contemporary conversational Korean. This might also explain why English queries work for both koclip-base and koclip-large. One reason against this hypothesis is that the authors of KLUE explicitly state in their paper that one criterion for text selection was that "the corpus must be written in contemporary Korean."

At the end of the day, we still found it intriguing that a model that was fine-tuned exclusively on Korean managed to produce semantic embeddings from English queries that work well with ViT.

Team

Acknowledgement

The FlaxHybridCLIP model was adpated from the Hugging Face transformer repository, under jax-projects. We also express gratitude to the teams at Google for generously offering TPU VMs for this project. Last but not least, we thank the KLUE team for making pretrained Korean RoBERTa-large weights publicly available.

References

@misc{park2021klue,
      title={KLUE: Korean Language Understanding Evaluation}, 
      author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
      year={2021},
      eprint={2105.09680},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{radford2021learning,
      title={Learning Transferable Visual Models From Natural Language Supervision}, 
      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
      year={2021},
      eprint={2103.00020},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{lin2015microsoft,
      title={Microsoft COCO: Common Objects in Context}, 
      author={Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollár},
      year={2015},
      eprint={1405.0312},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{srinivasan2021wit,
      title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, 
      author={Krishna Srinivasan and Karthik Raman and Jiecao Chen and Michael Bendersky and Marc Najork},
      year={2021},
      eprint={2103.01913},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

You might also like...

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

Comments

TypeError: init() got an unexpected keyword argument '_do_init'

Colab 환경에서 실행 시 위와 같은 오류가 납니다. 혹시 어떠한 이유에서 생겨나는 오류이신지 확인해 주실 수 있으신가요?

TypeError                                 Traceback (most recent call last)
[<ipython-input-8-20ea54d41cb6>](https://localhost:8080/#) in <module>
      5 from koclip import load_koclip
      6 
----> 7 model, processor = load_koclip("koclip-base")
      8 

2 frames
[/content/koclip/koclip/model.py](https://localhost:8080/#) in __init__(self, config, input_shape, seed, dtype, **kwargs)
    159             )
    160 
--> 161         module = self.module_class(config=config, dtype=dtype, **kwargs)
    162         super().__init__(
    163             config, module, input_shape=input_shape, seed=seed, dtype=dtype

TypeError: __init__() got an unexpected keyword argument '_do_init'

opened by seoujn 3

inference.ipynb 마지막 cell

안녕하세요. 좋은 모델 만들어주셔서 감사합니다.

해당 repository의 inference.ipynb의 마지막 cell에 "text" 라는 이름의 객체가 필요해보입니다.

마지막 cell의 맨 첫줄에 아래와 같이 추가하면 될거 같습니다.

기존

inputs = processor(
    text=["소파 위에 고양이", "강아지와 강아지 주인", "쳇바퀴를 달리는 햄스터", "자동차"],
    images=image, 
    return_tensors="jax", # could also be "pt" 
    padding=True
)

...(생략)...

수정 (제안)

text = ["소파 위에 고양이", "강아지와 강아지 주인", "쳇바퀴를 달리는 햄스터", "자동차"]
inputs = processor(
    text=text,
    images=image, 
    return_tensors="jax", # could also be "pt" 
    padding=True
)

...(생략)...

opened by ByungSun12 1

사전 학습된 체크포인트로부터 학습을 진행할 수 있는 Argument를 추가하였습니다.

안녕하세요. 현재 koclip으로 여러가지를 해보고 있습니다.

다름이 아니라, 기존의 사전 pretrained된 모델을 기반으로 더 학습시켜보고 싶었는데, 기존의 koclip모델에는 해당 arg이 없어서 제가 다른 clip을 참조하여 추가시켜서 학습시켜보았는데 잘 되는 것 같아서 한번 용기내어 PR을 올려봅니다.

clip-italian버전에서 참조하여, --run_from_checkpoint argument를 사용하면 기존에 학습했던 체크포인트에서 이어서 학습할 수 있도록 해보았습니다.

갑작스러우시겠지만 한번 확인해주셨으면 합니다. 감사합니다.

opened by adventure2165 0

KoCLIP: Korean port of OpenAI CLIP, in Flax

Related tags

Overview

KoCLIP

Demo

Quickstart

Models

Training

Findings

Prompting

Multilinguality

Team

Acknowledgement

References

You might also like...

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

Standalone pre-training recipe with JAX+Flax

Local Attention - Flax module for Jax

Implementation of FitVid video prediction model in JAX/Flax.

RoBERTa Marathi Language model trained from scratch during huggingface 🤗 x flax community week

Implementation of experiments in the paper Clockwork Variational Autoencoders (project website) using JAX and Flax

Reimplementation of the paper "Attention, Learn to Solve Routing Problems!" in jax/flax.

JAXDL: JAX (Flax) Deep Learning Library

Advantage Actor Critic (A2C): jax + flax implementation

Comments

TypeError: __init__() got an unexpected keyword argument '_do_init'

inference.ipynb 마지막 cell

기존

수정 (제안)

사전 학습된 체크포인트로부터 학습을 진행할 수 있는 Argument를 추가하였습니다.

Owner

Jake Tae

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

A containerized REST API around OpenAI's CLIP model.

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

Simple implementation of OpenAI CLIP model in PyTorch.

A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Flax is a neural network ecosystem for JAX that is designed for flexibility.

Very deep VAEs in JAX/Flax

TypeError: init() got an unexpected keyword argument '_do_init'