TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

peng gao

Last update: Jan 4, 2023

Related tags

Deep Learning Tip-Adapter

Overview

TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

This is the official code release for the paper 'TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling'.

Introduction

Tip-Adapter provides faster convergence and better performance than CLIP-Adapter by initializing the adapter with a cache model.

Implementation

Put tip_adapter_ImageNet.py into clip's folder and run

python tip_adapter_ImageNet.py

you will get 65.51% on ImageNet validation set.

This repo will be completed in a few days.

Contributors

Peng Gao, Renrui Zhang

Acknowledgment

CLIP, CoOp and CLIP-Adapter

Comments

Are CLIP/TIP-Adapter only designed for the few-shot setting?

Sorry I've got another question. I did not find experiments under the base-to-new/domain generalization setting and cross-dataset transfer setting, which is conducted by CoCoOp. Are CLIP/TIP-Adapter only designed for the few-shot setting? I wonder how the generation abilities are. Maybe you can give me any intuition?

opened by machengcheng2016 4
Details of data augmentation

In the paper, "the CLIP-style pre-processing resizes the cropped image’s short side to 224 while keeping its original aspect", and you said that you use the CLIP-style RandomResizeCrop.

However, I found that in the code, the standard RandomResizeCrop is used.

I wonder that is this setting important to the final performance or I misunderstood here?

opened by SY-Xuan 3
replicate your results on food101 dataset

Would you consider providing the script to replicate your results on food101 dataset? If someone is to adapt your script on ImageNet, do you have suggestions on what to make sure to adjust?

opened by yinyinl 3
Adaptor used in vision encoder or text encoder?

Hey, Thanks for nice work. I have some confusion as follows. First, why the adaptor is used only in vision encoder, did the authors try to use the adaptor in text encoder? Second, I don't understand why using adaptor performs better using learnable prompt. In addition, the "adaptor" used in this paper is different from the adaptor in NLP tasks, also the position of the insertion is different, which one is better?

opened by jingzhengli 2
"Tip-Adapter/main.py" use test features to eval

It seems odd to use test features to eval. see https://github.com/gaopengcuhk/Tip-Adapter/blob/fcb06059457a3b74e44ddb0d5c96d2ea7e4c5957/main.py#L111 Could authors give some explanation?

opened by fikry102 1
How to extend to base-to-novel classes task?

Hi, This method modifies the parameters of the text encoder, so it cannot extend to base-to-new classes tasks. I would like to know how to address this problem.

opened by jingzhengli 3
Run TIP-adapter on text2img retrieval instead

Hi, thanks for the amazing work on adapters on CLIP. Currently the framework computes the affinities between the test query image and the cache keys, before obtaining the corresponding few-shot label. This works well and good. I would just like your advise on how can i extend this to text2img retrieval where I would like to query with text search term, and utilise the cache key-value adapter to return corresponding images. Would it be as naive as to do a text to text embedding affinity matching of the query text with the cache VALUES (instead of keys) as they contain the ground truth labels for the few-shot learning?

opened by adrielkuek 3
The "alpha" and "beta" in the paper are the opposite of the "alpha" and "beta" in the code of Tip-Adapter

In Code, "alpha_list = [i * (6.0 - 1.0) / 20 + 1 for i in range(20)] " "beta_list = [i * (7 - 0.1) / 200 + 0.1 for i in range(200)]" In paper,

opened by euminds 3
Bug when I try cifar100

Thanks for your work. When I try your code on CIFAR100, I got this error and I dont know how to slove it. Due to ImageNet's huge number of images, I can only do this. PLS help.

Torch version: 1.7.1 Namespace(alpha=1, augment_epoch=10, beta=1.17, lr=0.001, train_epoch=20) Model parameters: 151,277,313 Input resolution: 224 Context length: 77 Vocab size: 49408 Load data finished. start getting text features. finish getting text features. start getting image features start saving training image features Augment time: 0 / 10 3%|▉ | 6/196 [00:03<01:45, 1.81it/s] Traceback (most recent call last): File "main.py", line 487, in <module> main() File "main.py", line 244, in main for i, (images, target) in enumerate(tqdm(train_loader)): File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/tqdm/std.py", line 1180, in __iter__ for obj in iterable: File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in __next__ data = self._next_data() File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/yh/.conda/envs/prompt/lib/python3.6/site-packages/torchvision/datasets/cifar.py", line 113, in __getitem__ img, target = self.data[index], self.targets[index] IndexError: list index out of range [1]+ Killed python main.py

opened by heng-yin 2

Owner

peng gao

Young Scientist at Shanghai AI Lab

GitHub

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

17 Mar 10, 2022

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

78 Dec 27, 2022

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

1.3k Dec 31, 2022

TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Related tags

Overview

TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Introduction

Implementation

Contributors

Acknowledgment

Comments

Are CLIP/TIP-Adapter only designed for the few-shot setting?

Details of data augmentation

replicate your results on food101 dataset

Adaptor used in vision encoder or text encoder?

"Tip-Adapter/main.py" use test features to eval

How to extend to base-to-novel classes task?

Run TIP-adapter on text2img retrieval instead

The "alpha" and "beta" in the paper are the opposite of the "alpha" and "beta" in the code of Tip-Adapter

Bug when I try cifar100

Owner

peng gao

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Deduplicating Training Data Makes Language Models Better

Official codebase for ICLR oral paper Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling

(IEEE TIP 2021) Regularized Densely-connected Pyramid Network for Salient Instance Segmentation

[TIP 2020] Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion

Code release for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification (TIP 2020)

[TIP 2021] SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

Code for the TIP 2021 Paper "Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss"

Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted)

PyTorch implementation of Deep HDR Imaging via A Non-Local Network (TIP 2020).

TipToiDog - Tip Toi Dog With Python

Yoloxkeypointsegment - An anchor-free version of YOLO, with a simpler design but better performance

A task-agnostic vision-language architecture as a step towards General Purpose Vision

A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

《K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters》(2020)

Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

The Adapter-Bot: All-In-One Controllable Conversational Model