Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Sense-GVT

Last update: Dec 30, 2022

Related tags

Deep Learning DeCLIP

Overview

DeCLIP

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.

Our paper is available in arxiv

Updates

** Our code, dataset and models will be relased soon**

Introduction

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radfordet al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our DeCLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from these intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1× fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.

Model

Our pretrain visual backbone model (w/o text encoder)

DeCLIP_r50 GoogleDriver.
DeCLIP_vitb32 GoogleDriver

Citing DeCLIP

@misc{li2021supervision,
      title={Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm}, 
      author={Yangguang Li and Feng Liang and Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},
      year={2021},
      eprint={2110.05208},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Comments

Fused AdamW_SGD optimizer issues

Hi, authors! Thanks for your awesome work! I'm confused about the usage of fused AdamW_SGD optimizer as described in paper Appendix C, paragraph implementation details. It's said you use AdamW with 1e-3 lr and 0.05 wd for ViT vision encoder, and SGD with 0.02 lr and 1e-4 wd for text transformer. However, in your configuration, ViT-B/32 is also optimized by SGD instead of fused AdamW_SGD. So which optimizer is your choice in experiment actually? And, if you use fused AdamW_SGD optimizer just as said in paper, why did you use it? CLIP only uses AdamW optimizer. Is this beneficial to CLIP? Looking forward for your reply!😁

opened by vealocia 4
About the BPE file

Hi~ @zlccccc @SlotherCui I notice that there isn't BPE file here. In your token embedding weight, the shape is [49409, 512], but the shape in CLIP is [49408, 512]. Are yours BPE file consistent with CLIP? If I missed something, please comment~ Thanks a lot!

opened by kugwzk 2
worked (simple) example of loading model and transforms?
Thank you for this exciting repository. Can you provide a simple example of how I might be able to load the models you provide in your model zoo?

Something along the lines of what is provided by the timm (pytorch-image-models) model repository:

import timm model_name = 'ghostnet_100' model = timm.create_model(model_name, pretrained=True) model.eval() from timm.data.transforms_factory import create_transform from timm.data import resolve_data_config config = resolve_data_config({}, model = model_name) transform = create_transform(**config)

Ideally, this would allow us to use the models in a jupyter notebook or other interactive context.

Thanks in advance!
opened by ColinConwell 1
KeyError: 'SLURM_PROCID'

I use the followed command to run zero-shot evaluation: python -m prototype.solver.clip_solver --config ./experiments/declip_experiments/declip88m/declip88m_r50_declip/config.yaml --evaluate And then it report this error: import FusedFP16SGD failed, FusedFP16AdamW replace slurm Traceback (most recent call last): File "/opt/conda/envs/openmmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/conda/envs/openmmlab/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/apdcephfs/share_1227775/mingzhenzhu/DeCLIP/prototype/solver/clip_solver.py", line 769, in <module> main() File "/apdcephfs/share_1227775/mingzhenzhu/DeCLIP/prototype/utils/dist.py", line 11, in wrapper dist_init() File "/apdcephfs/share_1227775/mingzhenzhu/DeCLIP/prototype/utils/dist.py", line 21, in dist_init proc_id = int(os.environ['SLURM_PROCID']) File "/opt/conda/envs/openmmlab/lib/python3.7/os.py", line 681, in __getitem__ raise KeyError(key) from None KeyError: 'SLURM_PROCID' How to fix it? Thanks!

opened by mZhenz 0
Performance of Declip-88M checkpoint

Hi, I want to reproduce the zero-shot result of DeClip-88M under ResNet50 in ImageNet-1K (whose performance is 62.5 in the table). But the evaluation result I got is 7.264 which is too low. But the result of ViT-B32 is correct. And I found a problem during loading the ResNet50 checkpoint:

size mismatch for module.logit_scale: copying a param with shape torch.Size([]) from checkpoint, the shape in current model is torch.Size([1]).

I didn't change any code of the model.

Another question is that why run.sh of declip-88m-resnet50 uses clip_solver while other run.sh files use declip_solver? I use declip_solver to do the evaluation for DeClip-88M-ResNet50 by replacing the yaml file. The following figure is the results reproduced on my own compute resources:

Do you have any ideas? Thanks!

opened by Hcyang-NULL 4
module 'nvidia.dali.ops' has no attribute 'McReader'

https://github.com/Sense-GVT/DeCLIP/blob/e47a5ff99ddbd635b2b7b4a7c6490e1d9e03821d/prototype/data/pipelines/imagenet_pipeline_v2.py#L42

I use the nvidia-dali-cuda110 with version 1.14.0, and get the error: module 'nvidia.dali.ops' has no attribute 'McReader'

In the requirements, the need nvidia-dali is 0.14, but there is no nvidia-dali=0.14.

opened by PanXiebit 1
Filter YFCC data

Hi, thanks for the great work. After downloading the provided YFCC15M label file, I can see there are three keys caption filename url in each one of the labels. how should we find the corresponding YFCC image according to your label? i.e., which key should we use to align with YFCC data?

opened by Hxyou 3

Owner

Sense-GVT

GitHub

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

17 Mar 10, 2022

SUPERVISED-CONTRASTIVE-LEARNING-FOR-PRE-TRAINED-LANGUAGE-MODEL-FINE-TUNING - The Facebook paper about fine tuning RoBERTa with contrastive loss

"# SUPERVISED-CONTRASTIVE-LEARNING-FOR-PRE-TRAINED-LANGUAGE-MODEL-FINE-TUNING" i

28 Dec 12, 2022

Learning trajectory representations using self-supervision and programmatic supervision.

Trajectory Embedding for Behavior Analysis (TREBA) Implementation from the paper: Jennifer J. Sun, Ann Kennedy, Eric Zhan, David J. Anderson, Yisong Y

58 Jan 6, 2023

Mixup for Supervision, Semi- and Self-Supervision Learning Toolbox and Benchmark

OpenSelfSup News Downstream tasks now support more methods(Mask RCNN-FPN, RetinaNet, Keypoints RCNN) and more datasets(Cityscapes). 'GaussianBlur' is

332 Jan 3, 2023

Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery (ICCV 2021)

Change is Everywhere Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery by Zhuo Zheng, Ailong Ma, Liangpei Zhang and Yanfei

125 Dec 13, 2022

PlaidML is a framework for making deep learning work everywhere.

4.5k Jan 2, 2023

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

1.3k Dec 31, 2022

The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

284 Dec 21, 2022

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv] VideoMAE: Masked Autoencoders are Data-Efficient Learne

Multimedia Computing Group, Nanjing University

697 Jan 7, 2023

This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

268 Jan 9, 2023

This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

32 Sep 25, 2021

Video Instance Segmentation with a Propose-Reduce Paradigm (ICCV 2021)

Propose-Reduce VIS This repo contains the official implementation for the paper: Video Instance Segmentation with a Propose-Reduce Paradigm Huaijia Li

39 Nov 23, 2022

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetu

3 Dec 5, 2022

ByteTrack with ReID module following the paradigm of FairMOT, tracking strategy is borrowed from FairMOT/JDE.

ByteTrack_ReID ByteTrack is the SOTA tracker in MOT benchmarks with strong detector YOLOX and a simple association strategy only based on motion infor

46 Dec 29, 2022

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

74 Dec 3, 2022

Code of our paper "Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning"

CCOP Code of our paper Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning Requirement Install OpenSelfSup Install Detectron2

21 Dec 13, 2022

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Language–Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

52 Jan 7, 2023

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

75 Nov 2, 2022

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly Code for this paper Ultra-Data-Efficient GAN Tra

77 Oct 5, 2022

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Related tags

Overview

DeCLIP

Updates

Introduction

Model

Our pretrain visual backbone model (w/o text encoder)

Citing DeCLIP

Comments

Fused AdamW_SGD optimizer issues

About the BPE file

worked (simple) example of loading model and transforms?

KeyError: 'SLURM_PROCID'

Performance of Declip-88M checkpoint

module 'nvidia.dali.ops' has no attribute 'McReader'

Filter YFCC data

Owner

Sense-GVT

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

SUPERVISED-CONTRASTIVE-LEARNING-FOR-PRE-TRAINED-LANGUAGE-MODEL-FINE-TUNING - The Facebook paper about fine tuning RoBERTa with contrastive loss

Learning trajectory representations using self-supervision and programmatic supervision.

Mixup for Supervision, Semi- and Self-Supervision Learning Toolbox and Benchmark

Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery (ICCV 2021)

PlaidML is a framework for making deep learning work everywhere.

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

Video Instance Segmentation with a Propose-Reduce Paradigm (ICCV 2021)

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

ByteTrack with ReID module following the paradigm of FairMOT, tracking strategy is borrowed from FairMOT/JDE.

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Code of our paper "Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning"

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly