Learning to Prompt for Vision-Language Models.

Related tags

Deep Learning CoOp
Overview

CoOp

Paper: Learning to Prompt for Vision-Language Models

Authors: Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu

CoOp (Context Optimization) is a differentiable approach that focuses on continuous prompt learning to facilitate deployment of pre-trained vision language models (like CLIP) in downstream datasets.

Updates

  • 15.10.2021: We find that the best_val model and the last_step model achieve similar performance, so we set TEST.FINAL_MODEL = "last_step" for all datasets to save training time. Why we used best_val: the (tiny) validation set was designed for the linear probe approach, which requires extensive tuning for its hyperparameters, so we used the best_val model for CoOp as well for fair comparison (in this way, both approaches have access to the validation set).

  • 09.10.2021: Important changes are made to Dassl's transforms.py. Please pull the latest commits from https://github.com/KaiyangZhou/Dassl.pytorch and this repo to make sure the code works properly. In particular, 1) center_crop now becomes a default transform in testing (applied after resizing the smaller edge to a certain size to keep the image aspect ratio), and 2) for training, Resize(cfg.INPUT.SIZE) is deactivated when random_crop or random_resized_crop is used. Please read this issue on how these changes might affect the performance.

  • 18.09.2021: We have fixed an error in Dassl which could cause a training data loader to have zero length (so no training will be performed) when the dataset size is smaller than the batch size (due to drop_last=True). Please pull the latest commit for Dassl (>= 8eecc3c). This error led to lower results for CoOp in EuroSAT's 1- and 2-shot settings (others are all correct). We will update the paper on arxiv to fix this error.

How to Install

This code is built on top of the awesome toolbox Dassl.pytorch so you need to install the dassl environment first. Simply follow the instructions described here to install dassl as well as PyTorch. After that, run pip install -r requirements.txt under CoOp/ to install a few more packages required by CLIP (this should be done when dassl is activated). Then, you are ready to go.

Follow DATASETS.md to install the datasets.

How to Run

We provide the running scripts in scripts/. Make sure you change the path in DATA and run the commands under CoOp/scripts/.

Few-Shot Learning

All you need is CoOp/scripts/main.sh, which contains six input arguments.

DATASET takes as input a dataset name, like imagenet or caltech101. The valid names are the files' names in CoOp/configs/datasets/.

CFG means which config file to use, such as rn50, rn101 or vit_b32 (see CoOp/configs/trainers/CoOp/). Note that for ImageNet, we use CoOp/configs/trainers/CoOp/*_ep50.yaml for all settings (please follow the implementation details shown in the paper).

Below we provide examples on how to run CoOp on Caltech101.

CLIP + CoOp (M=16, end):

  • 1 shot: bash main.sh caltech101 rn50_ep50 end 16 1 False
  • 2 shots: bash main.sh caltech101 rn50_ep100 end 16 2 False
  • 4 shots: bash main.sh caltech101 rn50_ep100 end 16 4 False
  • 8 shots: bash main.sh caltech101 rn50 end 16 8 False
  • 16 shots: bash main.sh caltech101 rn50 end 16 16 False

CLIP + CoOp (M=16, mid):

  • 1 shot: bash main.sh caltech101 rn50_ep50 middle 16 1 False
  • 2 shots: bash main.sh caltech101 rn50_ep100 middle 16 2 False
  • 4 shots: bash main.sh caltech101 rn50_ep100 middle 16 4 False
  • 8 shots: bash main.sh caltech101 rn50 middle 16 8 False
  • 16 shots: bash main.sh caltech101 rn50 middle 16 16 False

CLIP + CoOp (M=16, end, CSC):

  • 1 shot: bash main.sh caltech101 rn50_ep50 end 16 1 True
  • 2 shots: bash main.sh caltech101 rn50_ep100 end 16 2 True
  • 4 shots: bash main.sh caltech101 rn50_ep100 end 16 4 True
  • 8 shots: bash main.sh caltech101 rn50 end 16 8 True
  • 16 shots: bash main.sh caltech101 rn50 end 16 16 True

CLIP + CoOp (M=16, mid, CSC):

  • 1 shot: bash main.sh caltech101 rn50_ep50 middle 16 1 True
  • 2 shots: bash main.sh caltech101 rn50_ep100 middle 16 2 True
  • 4 shots: bash main.sh caltech101 rn50_ep100 middle 16 4 True
  • 8 shots: bash main.sh caltech101 rn50 middle 16 8 True
  • 16 shots: bash main.sh caltech101 rn50 middle 16 16 True

After the experiments are finished, you can use parse_test_res.py to calculate the average results instead of manually looking into the log files. Say the structure of output/ is

output
|–– caltech101/
|   |–– CoOp/
|   |   |–– rn50_16shots/
|   |   |   |–– nctx16_cscFalse_ctpend/
|   |   |   |   |–– seed1/
|   |   |   |   |–– seed2/
|   |   |   |   |–– seed3/
|   |   |–– rn50_8shots/
|   |   |   |–– nctx16_cscFalse_ctpend/
|   |   |   |   |–– seed1/
|   |   |   |   |–– seed2/
|   |   |   |   |–– seed3/

To calculate the average results for the folder rn50_16shots/nctx16_cscFalse_ctpend/, you can run

python parse_test_res.py output/caltech101/CoOp/rn50_16shots/nctx16_cscFalse_ctpend

Then, you will see something like this in your terminal

Parsing files in output/caltech101/CoOp/rn50_16shots/nctx16_cscFalse_ctpend
file: output/caltech101/CoOp/rn50_16shots/nctx16_cscFalse_ctpend/seed1/log.txt. accuracy: 91.81%. error: 8.19%.
file: output/caltech101/CoOp/rn50_16shots/nctx16_cscFalse_ctpend/seed2/log.txt. accuracy: 92.01%. error: 7.99%.
file: output/caltech101/CoOp/rn50_16shots/nctx16_cscFalse_ctpend/seed3/log.txt. accuracy: 92.17%. error: 7.83%.
===
Summary of directory: output/caltech101/CoOp/rn50_16shots/nctx16_cscFalse_ctpend
* accuracy: 92.00% +- 0.15%
* error: 8.00% +- 0.15%
===

How to initialize the context tokens with pre-trained word vectors? Specify the words for the parameter TRAINER.COOP.CTX_INIT in your config file. In our paper, we use configs/trainers/rn50_ctxv1.yaml (give this file to --config-file, see scripts/main.sh), which uses "a photo of a" as the initialization words.

How to visualize nearest words for the learned context tokens? All you need is interpret_prompt.py. Say the learned tokens are saved in a/b/c/prompt_learner/model.pth.tar and you would like to see the top-3 nearest words for each token. In this case, run python interpret_prompt.py a/b/c/prompt_learner/model.pth.tar 3

Robustness to Distribution Shift

To reproduce the robustness experiments, you can simply load the models learned on ImageNet and evaluate them on the following datasets: imagenetv2, imagenet-sketch, imagenet-a and imagenet-r.

The command is provided in CoOp/scripts/eval.sh. The key arguments are --model-dir, --load-epoch and --eval-only. --model-dir indicates the directory where the models are saved (i.e. the entire folder containing log.txt, the tensorboard file and prompt_learner/). --load-epoch tells the code to load the model saved at a specific epoch, like --load-epoch 50 for ImageNet (see the source code for more details).

For example, to evaluate CLIP + CoOp (M=16, end) on ImageNetV2, you can do

# Don't need to use rn5_ep50 here as no training is performed
bash eval.sh imagenetv2 rn50

The default setting is SHOTS=16. Feel free to modify the script.

Again, you can use parse_test_res.py to automate the calculation of average performance. This time you should append --test-log, e.g., python parse_test_res.py directory --test-log.

Zero-Shot CLIP

See CoOp/scripts/zeroshot.sh.

Linear Probe CLIP

Please move to lpclip/.

How to Cite CoOp

If you use this code in your research, please kindly cite the following paper

@article{zhou2021coop,
    title={Learning to Prompt for Vision-Language Models},
    author={Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei},
    journal={arXiv preprint arXiv:2109.01134},
    year={2021}
}
Comments
  • Cannot reproduce the accuracy on StanfordCars dataset

    Cannot reproduce the accuracy on StanfordCars dataset

    Greetings! I can only get 46.81% accuracy and 47.05% per-class accuracy after running bash zeroshot.sh stanford_cars rn50. However, the reported accuracy on StanfordCars dataset is ~55%. What's wrong?

    opened by machengcheng2016 8
  • Different random seeds lead to highly variable results.

    Different random seeds lead to highly variable results.

    First of all, thank you for open sourcing such an easy to use code :) I reproduced your reported results in CoOp on two datasets, DTD and Flower101. I ran the code with three random seeds,1,2 and 3 for both datasets, as your default setting in ./scripts/main.sh. The performance of model on DTD is as well as the result in paper (acc: 63.46) when trained with seed=3, but the results of seed 1 and 2 are poor (acc: ~15). As for Flower101, the result of seed 2 and 3 are ~94, but seed 1's result is 44.50

    I wonder if this is a normal situation for this few shot training setting? Thanks for any suggestion :)

    opened by guozix 4
  • Why do token prefixes have to be a buffer type

    Why do token prefixes have to be a buffer type

    If the token prefixes and suffix are just the slice of the embedding, for example, replacing self.register_buffer("token_prefix", embedding[:, :1, :]) with self.token_prefix=embedding[:, :1, :]) in this line, we will not have to ignore those when loading. Therefore, why do token prefixes have to be a buffer type? Thanks a lot!

    opened by nbl97 4
  •  Few-shot setting in CoCoOp Experiments

    Few-shot setting in CoCoOp Experiments

    Hello, Thank you for sharing your great work.

    I had a question regarding the few-shot setting in the CoCoOp experiments. In the paper, it is mentioned that CoCoOp follows a zero-shot evaluation (from base to novel classes) but for training the base classes, it uses a few-shot setting. However, generally for zero-shot evaluation, models are trained on the complete base classes.

    Does this mean that, CoCoOp and CoOp requires only a few-shot setting to perform well on novel categories. Can the same training recipe of CoCoOp or CoOp be used by training all examples of the base classes?

    Thank you and kind regards.

    opened by muzairkhattak 2
  • GPU Memory Consumption of CoCoOp

    GPU Memory Consumption of CoCoOp

    Hi, Thanks a lot for the excellent work and the easy-to-use code! Recently I've been trying to use CoOp and CoCoOp in my research. However I encounter a small problem: the GPU consumption of CoCoOp seems to be much larger (about 64X under my setting) than CoOp, resulting in small batch size and very long training time. Based on my understanding, the reason is that the prompts in CoCoOp should be given to each instance instead of each batch. I've seen the same problem reported in the paper. May I ask whether there are any tricks during training to accelerate the training process? Thanks so much!

    opened by Dou-Yiming 2
  • About the configuration of

    About the configuration of "classnames"

    Thanks for your contributions!

    I have a question about "classnames = self.dm.dataset.classnames" (the line 224 of "CoOp.build_model" in coop.py). What is the value of "classnames"? I checked the configuration files and didn't find out.

    opened by xujinglin 2
  • question about gradients on text encoder

    question about gradients on text encoder

    Hi, may I ask if the gradients of the original CLIP text encoder are frozen or not? The paper mentioned that the gradients of text encoder is frozen, but I couldn't find that part in the code... Thanks a lot for your help!

    opened by vincentlux 2
  • torresyu_pr

    torresyu_pr

    Dear Kaiyang, I find a bug in the shells for CoCoOp. I think a correct way is to remove the top line "cd ../../" in the .sh scripts, or it will raise "train.py no such file" error. Thanks.

    opened by geekyutao 1
  • zero-shot or fine-tune?

    zero-shot or fine-tune?

    1. To my knowledge, CLIP can be directly used applied to zero-shot learning (i.e., unseen/novel classes). coop and cocoop don't appear to be zero-shot learning, but require fine-tuning. However, I don't see the detials about how to fine-tuning in paper. Am I misunderstand it? In the meantime, I would like to know how the CLIP is fine-tuned.
    2. I cannot understand the figure 1 in paper: why the performance of coop and cocoop can be compared to zero-shot learning.
    opened by jingzhengli 1
  • Much better CoOp performance

    Much better CoOp performance

    Thanks for your great work! I tried to use your code to reproduce some results of CoOp reported in your CoCoOp paper. I tried this model on the dtd dataset with: bash main.sh dtd vit_b16_ep50 end 4 16 False. Which is exactly same as the setting in the paper. I got a much higher performance: accuracy: 67.38% +- 0.51%. But the paper report CoOp performance as 54.24.

    opened by fanq15 1
  • Inferencing on single image

    Inferencing on single image

    I have been successful in developing the train and test pipeline for my custom dataset. Can you help me out for making inference on a single image. I am using the trainer.model_inference(image) function. Is there a particular format this image needs to be in ? I am using PIL to read the image.

    Error: /ContextOptimization/CoOp/trainers/coop.py", line 196, in forward image_features = self.image_encoder(image.type(self.dtype)) File "/home/chandan/anaconda3/envs/coop/lib/python3.8/site-packages/PIL/Image.py", line 519, in getattr raise AttributeError(name) AttributeError: type

    Main function used:

    def main(args): cfg = setup_cfg(args) if cfg.SEED >= 0: print("Setting fixed seed: {}".format(cfg.SEED)) set_random_seed(cfg.SEED) setup_logger(cfg.OUTPUT_DIR)

    if torch.cuda.is_available() and cfg.USE_CUDA:
        torch.backends.cudnn.benchmark = True
    
    print_args(args, cfg)
    print("Collecting env info ...")
    print("** System info **\n{}\n".format(collect_env_info()))
    
    trainer = build_trainer(cfg)
    
    trainer.load_model(args.model_dir, epoch=args.load_epoch)
    image = Image.open('/ContextOptimization/CoOp/data/0cd2ed50.png')
    result = trainer.model_inference(image)
    print(result)
    return result
    

    I am looking for the predicted class and predicted probabilities as output.

    Any direction would be appreciative.

    Thanks

    opened by ChandanVerma 1
  • If CoCoOp can use ResNet as backbone

    If CoCoOp can use ResNet as backbone

    Thanks for your great work. I would like to ask you whether you have considered using a CNN network such as ResNet as the backbone in CoCoOp and whether it is possible to use it?

    opened by RuoyuChen10 0
  • When I change the code , the result will dropp considerably!

    When I change the code , the result will dropp considerably!

    Thanks for your great work! The previous issue has been solved, but I find a new issue. If I change the code (https://github.com/KaiyangZhou/CoOp/blob/ff61507c790454bce7c5052c3ac39e60772f1f89/trainers/coop.py#L248) as self.register_model("model", self.model, self.optim, self.sched). The result will drop considerably(20%)! Can you give me some advice?

    opened by Zhangwenyao1 0
  • the performance about full fine-tuning on ResNet.

    the performance about full fine-tuning on ResNet.

    Hi, thanks for the nice code. I found the performance is poor when full fine-tuning the ResNet-based CLIP on ImageNet while for ViT-based CLIP the performance is good. Do you have some insightful comments on why full fine-tuning or linear probing the ResNet-based CLIP makes the performance worse?

    opened by jingzhengli 0
  • About input of text

    About input of text

    Thanks for your great job! I want to ask why the input is not (image, text) at forward function, such as output = self.model(image, text) . And what is the scheme of matching text logits and image logits?

    opened by TitaniumOne 0
  • When I train my network on oxford_flower(epoch=200), it get a different result.

    When I train my network on oxford_flower(epoch=200), it get a different result.

    Dear Zhou: When I train my network on oxford_flower(epoch=200), it get a great result as follows: => result

    • total: 2,463
    • correct: 2,268
    • accuracy: 92.1%
    • error: 7.9%
    • macro_f1: 91.6% Elapsed: 0:14:32 But if I run it again(as your code show, it will use the model I trained last time, which got good results), it gets a bad result as follows: => result
    • total: 2,463
    • correct: 876
    • accuracy: 35.6%
    • error: 64.4%
    • macro_f1: 30.1%. I am not sure why it produces a bad result, can you give me some advice. (May it does not use the BN of the trained model)?
    opened by Zhangwenyao1 0
Owner
Kaiyang
Kaiyang
EMNLP 2021 Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections Ruiqi Zhong, Kristy Lee*, Zheng Zhang*, Dan Klein EMN

Ruiqi Zhong 42 Nov 3, 2022
This repository accompanies our paper “Do Prompt-Based Models Really Understand the Meaning of Their Prompts?”

This repository accompanies our paper “Do Prompt-Based Models Really Understand the Meaning of Their Prompts?” Usage To replicate our results in Secti

Albert Webson 64 Dec 11, 2022
A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

AI2 79 Dec 23, 2022
The Few-Shot Bot: Prompt-Based Learning for Dialogue Systems

Few-Shot Bot: Prompt-Based Learning for Dialogue Systems This repository includes the dataset, experiments results, and code for the paper: Few-Shot B

Andrea Madotto 103 Dec 28, 2022
Alex Pashevich 62 Dec 24, 2022
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Salesforce 1.3k Dec 31, 2022
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
The Power of Scale for Parameter-Efficient Prompt Tuning

The Power of Scale for Parameter-Efficient Prompt Tuning Implementation of soft embeddings from https://arxiv.org/abs/2104.08691v1 using Pytorch and H

Kip Parker 208 Dec 30, 2022
The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

Sun Yi 201 Nov 21, 2022
Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by

Mehdi Cherti 135 Dec 30, 2022
Code and datasets for the paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction"

KnowPrompt Code and datasets for our paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction" Requireme

ZJUNLP 137 Dec 31, 2022
a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LSTM layers

RNN-Playwrite a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LS

Arno Barton 1 Oct 29, 2021
Implementation of "The Power of Scale for Parameter-Efficient Prompt Tuning"

Prompt-Tuning Implementation of "The Power of Scale for Parameter-Efficient Prompt Tuning" Currently, we support the following huggigface models: Bart

Andrew Zeng 36 Dec 19, 2022
Codes for "Template-free Prompt Tuning for Few-shot NER".

EntLM The source codes for EntLM. Dependencies: Cuda 10.1, python 3.6.5 To install the required packages by following commands: $ pip3 install -r requ

null 77 Dec 27, 2022
[CVPR2022] Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos Created by Muheng Li, Lei Chen, Yueqi Duan, Zhilan Hu, Jianjiang Feng, Jie

null 58 Dec 23, 2022
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 209 Dec 30, 2022
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 9, 2023
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

null 1 Dec 24, 2021
Meta Language-Specific Layers in Multilingual Language Models

Meta Language-Specific Layers in Multilingual Language Models This repo contains the source codes for our paper On Negative Interference in Multilingu

Zirui Wang 20 Feb 13, 2022