Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Related tags

Deep Learning DeCLIP
Overview

DeCLIP

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.

Our paper is available in arxiv

Updates

** Our code, dataset and models will be relased soon**

Introduction

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radfordet al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our DeCLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from these intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1Ɨ fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.

main_figure

Model

Our pretrain visual backbone model (w/o text encoder)

DeCLIP_r50 GoogleDriver.
DeCLIP_vitb32 GoogleDriver

Citing DeCLIP

@misc{li2021supervision,
      title={Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm}, 
      author={Yangguang Li and Feng Liang and Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},
      year={2021},
      eprint={2110.05208},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Comments
  • Fused AdamW_SGD optimizer issues

    Fused AdamW_SGD optimizer issues

    Hi, authors! Thanks for your awesome work! I'm confused about the usage of fused AdamW_SGD optimizer as described in paper Appendix C, paragraph implementation details. It's said you use AdamW with 1e-3 lr and 0.05 wd for ViT vision encoder, and SGD with 0.02 lr and 1e-4 wd for text transformer. However, in your configuration, ViT-B/32 is also optimized by SGD instead of fused AdamW_SGD. So which optimizer is your choice in experiment actually? And, if you use fused AdamW_SGD optimizer just as said in paper, why did you use it? CLIP only uses AdamW optimizer. Is this beneficial to CLIP? Looking forward for your reply!šŸ˜

    opened by vealocia 4
  • About the BPE file

    About the BPE file

    Hi~ @zlccccc @SlotherCui I notice that there isn't BPE file here. In your token embedding weight, the shape is [49409, 512], but the shape in CLIP is [49408, 512]. Are yours BPE file consistent with CLIP? If I missed something, please comment~ Thanks a lot!

    opened by kugwzk 2
  • worked (simple) example of loading model and transforms?

    worked (simple) example of loading model and transforms?

    Thank you for this exciting repository. Can you provide a simple example of how I might be able to load the models you provide in your model zoo?

    Something along the lines of what is provided by the timm (pytorch-image-models) model repository:

    import timm
    model_name = 'ghostnet_100'
    model = timm.create_model(model_name, pretrained=True)
    model.eval()
    
    from timm.data.transforms_factory import create_transform
    from timm.data import resolve_data_config
        
    config = resolve_data_config({}, model = model_name)
    transform = create_transform(**config)
    

    Ideally, this would allow us to use the models in a jupyter notebook or other interactive context.

    Thanks in advance!

    opened by ColinConwell 1
  • KeyError: 'SLURM_PROCID'

    KeyError: 'SLURM_PROCID'

    I use the followed command to run zero-shot evaluation: python -m prototype.solver.clip_solver --config ./experiments/declip_experiments/declip88m/declip88m_r50_declip/config.yaml --evaluate And then it report this error: import FusedFP16SGD failed, FusedFP16AdamW replace slurm Traceback (most recent call last): File "/opt/conda/envs/openmmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/conda/envs/openmmlab/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/apdcephfs/share_1227775/mingzhenzhu/DeCLIP/prototype/solver/clip_solver.py", line 769, in <module> main() File "/apdcephfs/share_1227775/mingzhenzhu/DeCLIP/prototype/utils/dist.py", line 11, in wrapper dist_init() File "/apdcephfs/share_1227775/mingzhenzhu/DeCLIP/prototype/utils/dist.py", line 21, in dist_init proc_id = int(os.environ['SLURM_PROCID']) File "/opt/conda/envs/openmmlab/lib/python3.7/os.py", line 681, in __getitem__ raise KeyError(key) from None KeyError: 'SLURM_PROCID' How to fix it? Thanks!

    opened by mZhenz 0
  • Performance of Declip-88M checkpoint

    Performance of Declip-88M checkpoint

    Hi, I want to reproduce the zero-shot result of DeClip-88M under ResNet50 in ImageNet-1K (whose performance is 62.5 in the table). But the evaluation result I got is 7.264 which is too low. But the result of ViT-B32 is correct. And I found a problem during loading the ResNet50 checkpoint:

    size mismatch for module.logit_scale: copying a param with shape torch.Size([]) from checkpoint, the shape in current model is torch.Size([1]).

    I didn't change any code of the model.

    Another question is that why run.sh of declip-88m-resnet50 uses clip_solver while other run.sh files use declip_solver? I use declip_solver to do the evaluation for DeClip-88M-ResNet50 by replacing the yaml file. The following figure is the results reproduced on my own compute resources: image

    Do you have any ideas? Thanks!

    opened by Hcyang-NULL 4
  • module 'nvidia.dali.ops' has no attribute 'McReader'

    module 'nvidia.dali.ops' has no attribute 'McReader'

    https://github.com/Sense-GVT/DeCLIP/blob/e47a5ff99ddbd635b2b7b4a7c6490e1d9e03821d/prototype/data/pipelines/imagenet_pipeline_v2.py#L42

    I use the nvidia-dali-cuda110 with version 1.14.0, and get the error: module 'nvidia.dali.ops' has no attribute 'McReader'

    In the requirements, the need nvidia-dali is 0.14, but there is no nvidia-dali=0.14.

    opened by PanXiebit 1
  • Filter YFCC data

    Filter YFCC data

    Hi, thanks for the great work. After downloading the provided YFCC15M label file, I can see there are three keys caption filename url in each one of the labels. how should we find the corresponding YFCC image according to your label? i.e., which key should we use to align with YFCC data?

    opened by Hxyou 3
Owner
Sense-GVT
Sense-GVT
CLIP (Contrastive Languageā€“Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Galuh 17 Mar 10, 2022
Saeed Lotfi 28 Dec 12, 2022
Learning trajectory representations using self-supervision and programmatic supervision.

Trajectory Embedding for Behavior Analysis (TREBA) Implementation from the paper: Jennifer J. Sun, Ann Kennedy, Eric Zhan, David J. Anderson, Yisong Y

null 58 Jan 6, 2023
Mixup for Supervision, Semi- and Self-Supervision Learning Toolbox and Benchmark

OpenSelfSup News Downstream tasks now support more methods(Mask RCNN-FPN, RetinaNet, Keypoints RCNN) and more datasets(Cityscapes). 'GaussianBlur' is

AI Lab, Westlake University 332 Jan 3, 2023
Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery (ICCV 2021)

Change is Everywhere Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery by Zhuo Zheng, Ailong Ma, Liangpei Zhang and Yanfei

Zhuo Zheng 125 Dec 13, 2022
PlaidML is a framework for making deep learning work everywhere.

A platform for making deep learning work everywhere. Documentation | Installation Instructions | Building PlaidML | Contributing | Troubleshooting | R

PlaidML 4.5k Jan 2, 2023
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Salesforce 1.3k Dec 31, 2022
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

Gowthami Somepalli 284 Dec 21, 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv] VideoMAE: Masked Autoencoders are Data-Efficient Learne

Multimedia Computing Group, Nanjing University 697 Jan 7, 2023
This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

null 268 Jan 9, 2023
This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

null 32 Sep 25, 2021
Video Instance Segmentation with a Propose-Reduce Paradigm (ICCV 2021)

Propose-Reduce VIS This repo contains the official implementation for the paper: Video Instance Segmentation with a Propose-Reduce Paradigm Huaijia Li

DV Lab 39 Nov 23, 2022
Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetu

null 3 Dec 5, 2022
ByteTrack with ReID module following the paradigm of FairMOT, tracking strategy is borrowed from FairMOT/JDE.

ByteTrack_ReID ByteTrack is the SOTA tracker in MOT benchmarks with strong detector YOLOX and a simple association strategy only based on motion infor

Han GuangXin 46 Dec 29, 2022
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Hao Tan 74 Dec 3, 2022
Code of our paper "Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning"

CCOP Code of our paper Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning Requirement Install OpenSelfSup Install Detectron2

Chenhongyi Yang 21 Dec 13, 2022
CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Languageā€“Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

Myeongjun Kim 52 Jan 7, 2023
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 2, 2022
Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly Code for this paper Ultra-Data-Efficient GAN Tra

VITA 77 Oct 5, 2022