ruCLIP-SB
RuCLIP-SB (Russian Contrastive Language–Image Pretraining SWIN-BERT) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. Unlike other versions of the model we use BERT for text encoder and SWIN transformer for image encoder.
Our model achieved 37.02% zero-shot accuracy on CIFAR100 and has 39543907 parameters.
ruCLIP-SB
Download URL:Example usage:
Finetuning:
ONNX example:
We trained model on 2 millions images.