This repository contains the database and code used in the paper Embedding Arithmetic for Text-driven Image Transformation

Meta Research

Last update: Oct 17, 2022

Related tags

Deep Learning SIMAT

Overview

This repository contains the database and code used in the paper Embedding Arithmetic for Text-driven Image Transformation (Guillaume Couairon, Holger Schwenk, Matthijs Douze, Matthieu Cord)

The inspiration for this work are the geometric properties of word embeddings, such as Queen ~ Woman + (King - Man). We extend this idea to multimodal embedding spaces (like CLIP), which let us semantically edit images via "delta vectors".

Transformed images can then be retrieved in a dataset of images.

The SIMAT Dataset

We build SIMAT, a dataset to evaluate the task of text-driven image transformation, for simple images that can be characterized by a single subject-relation-object annotation. A transformation query is a pair (image, query) where the query asks to change the subject, the relation or the object in the input image. SIMAT contains ~6k images and an average of 3 transformation queries per image.

The goal is to retrieve an image in the dataset that corresponds to the query specifications. We use OSCAR as an oracle to check whether retrieved images are correct with respect to the expected modifications.

Examples

Below are a few examples that are in the dataset, and images that were retrieved for our best-performing algorithm.

Download dataset

The SIMAT database is composed of crops of images from Visual Genome. You first need to install Visual Genome and then run the following command :

python prepare_dataset.py --VG_PATH=/path/to/visual/genome

Perform inference with CLIP ViT-B/32

In this example, we use the CLIP ViT-B/32 model to edit an image. Note that the dataset of clip embeddings is pre-computed.

import clip
from torchvision import datasets
from PIL import Image
from IPython.display import display

#hack to normalize tensors easily
torch.Tensor.normalize = lambda x:x/x.norm(dim=-1, keepdim=True)

# database to perform the retrieval step
dataset = datasets.ImageFolder('simat_db/images/')
db = torch.load('data/clip_simat.pt').float()

model, prep = clip.load('ViT-B/32', device='cuda:0', jit=False)

image = Image.open('simat_db/images/A cat sitting on a grass/98316.jpg')
img_enc = model.encode_image(prep(image).unsqueeze(0).to('cuda:0')).float().cpu().detach().normalize()

txt = ['cat', 'dog']
txt_enc = model.encode_text(clip.tokenize(txt).to('cuda:0')).float().cpu().detach().normalize()

# optionally, we can apply a linear layer on top of the embeddings
heads = torch.load(f'data/head_clip_t=0.1.pt')
img_enc = heads['img_head'](img_enc).normalize()
txt_enc = heads['txt_head'](txt_enc).normalize()
db = heads['img_head'](db).normalize()


# now we perform the transformation step
lbd = 1
target_enc = img_enc + lbd * (txt_enc[1] - txt_enc[0])


retrieved_idx = (db @ target_enc.float().T).argmax(0).item()


display(dataset[retrieved_idx][0])

Compute SIMAT scores with CLIP

You can run the evaluation script with the following command:

python eval.py --backbone clip --domain dev --tau 0.01 --lbd 1 2

It automatically load the adaptation layer relative to the value of tau.

Train adaptation layers on COCO

In this part, you can train linear layers after the CLIP encoder on the COCO dataset, to get a better alignment. Here is an example :

python adaptation.py --backbone ViT-B/32 --lr 0.001 --tau 0.1 --batch_size 512

Citation

If you find this paper or dataset useful for your research, please use the following.

@article{gco1embedding,
  title={Embedding Arithmetic for text-driven Image Transformation},
  author={Guillaume Couairon, Matthieu Cord, Matthijs Douze, Holger Schwenk},
  journal={arXiv preprint arXiv:2112.03162},
  year={2021}
}

References

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, OpenAI 2021

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, IJCV 2017

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao, Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020

License

The SIMAT is released under the MIT license. See LICENSE for details.

Comments

SIMAT score of zero-shot CLIP not reproduced

Hi! I'm trying to reproduce SIMAT score of zero shot CLIP in paper. (Table1 , Fig5) I used eval.py for reproducing this score, and change it slightly to remove adaptation layer.

def simat_eval(args):
    #img_head, txt_head, emb_key='clip', lbds=[1], test=True:, tau
    # get heads !
    emb_key = 'clip'
    heads = torch.load(f'data/head_{emb_key}_t={args.tau}.pt')
    output = {}
    transfos = pd.read_csv('simat_db/transfos.csv', index_col=0)
    transfos = transfos[transfos.is_test == (args.domain == 'test')]
    img_embs = torch.load('data/clip_simat_2.pt').float()


    img_embs = img_embs.normalize()
    #img_embs = heads['img_head'](clip_simat).normalize()
    value_embs = torch.stack([img_embs[did] for did in transfos.dataset_id])

    word_embs = dict(torch.load(f'data/simat_words_{emb_key}_2.ptd'))
    #w2v = {k:heads['txt_head'](v.float()).normalize() for k, v in word_embs.items()}
    w2v = {k:(v.float()).normalize() for k, v in word_embs.items()}
    delta_vectors = torch.stack([w2v[x.target] - w2v[x.value] for i, x in transfos.iterrows()])

    oscar_scores = torch.load('simat_db/oscar_similarity_matrix.pt')
    weights = 1/np.array(transfos.norm2)**.5
    weights = weights/sum(weights)

    for lbd in args.lbds:
        target_embs = value_embs + lbd*delta_vectors

        nnb = (target_embs @ img_embs.T).topk(5).indices

        nnb_notself = [r[0] if r[0].item() != t else r[1] for r, t in zip(nnb, transfos.dataset_id)]

        scores = np.array([oscar_scores[ri, tc] for ri, tc in zip(nnb_notself, transfos.target_ids)]) > .5


        output[lbd] = 100*np.average(scores, weights=weights)
    return output

when i use your embedding file (clip_simat.pt , simat_words_clip.ptd) it results well , but I cannot get full performance with the embedding that I made with encode.py.
(I used the original CLIP model in openAI repository, so maybe it is not problem of CLIP model.)

I would be appreciated if you check it. Thank you

bug

opened by junhyukso 1

Adding Code of Conduct file

This is pull request was created automatically because we noticed your project was missing a Code of Conduct file.

Code of Conduct files facilitate respectful and constructive communities by establishing expected behaviors for project contributors.

This PR was crafted with love by Facebook's Open Source Team.
CLA Signed

opened by facebook-github-bot 0
Adding Contributing file

This is pull request was created automatically because we noticed your project was missing a Contributing file.

CONTRIBUTING files explain how a developer can contribute to the project - which you should actively encourage.

This PR was crafted with love by Facebook's Open Source Team.
CLA Signed

opened by facebook-github-bot 0

RuntimeError when finetuning CLIP adaptation layers

Hi,

When I ran the command python adaptation.py --backbone ViT-B/32 --lr 0.001 --tau 0.1 --batch_size 512 --gpus 2, I encountered the error. The error is as follows :

  ...
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 93, in forward
    return self.module.validation_step(*inputs, **kwargs)
  File "adaptation.py", line 99, in validation_step
    img_ = self.core.encode_image(img).detach()
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/clip/model.py", line 341, in encode_image
    return self.visual(image.type(self.dtype))
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/clip/model.py", line 224, in forward
    x = self.conv1(x)  # shape = [*, width, grid, grid]
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [768, 3, 32, 32], expected input[1, 1536, 224, 224] to have 3 channels, but got 1536 channels instead

I thought the error occurred since the size of the batch and the channel are not decoupled. To resolve the error, I inserted a line of code img = img.reshape(-1, 3, img.shape[-1], img.shape[-1]) which separates the batch and the channel at Line77 and Line90 in adaptation.py. Then, I could fully run the code without the error.

Is my modification the right solution? I had another reproduction problem in this code, but I'm worried if it's the cause here.

Thanks, Janet

opened by parang99 0

Problem on reproducing adaptation layer
Hi @PhazCode,

Thanks for this great work.

I tried to reproduce training new adaptation layers of CLIP on the COCO dataset. After training, I compute SIMAT scores with my trained weights and the results are as follows :

lbd=1.0: 13.09 lbd=2.0: 25.78 lbd=3.0: 29.94 lbd=4.0: 28.73 lbd=5.0: 26.21

When I compute SIMAT scores with provided weights, the results are as follows :

lbd=1.0: 47.59 lbd=2.0: 35.78 lbd=3.0: 29.04 lbd=4.0: 26.36 lbd=5.0: 24.52

For training, I followed the hyperparameters mentioned in paper. The other parameters were initialized with the default values at adaptation.py. The hyperparameters are as follows :

max_epochs : 30

batch_size : 4096

lr : 1e-3

tau : 0.1

sched_step_size : 25

sched_gamma : 0.1

The command that I used was python adaptation.py --backbone ViT-B/32 --lr 1e-3 --tau 0.1 --batch_size 4096 --wandb --max_epochs 30 And I also ran the command written in README.md python adaptation.py --backbone ViT-B/32 --lr 0.001 --tau 0.1 --batch_size 512

Unfortunately I couldn't retreive the same or similar results of the provided weights in both commands. Is there any mistakes that I did or any solution to resolve this problem?

My settings are as follows:

Pytorch : 1.12.1

CUDA : 11.3

Python : 3.8.13

gpu : 8 Tesla V100 GPUs

Thanks, Janet
opened by parang99 0
Unable to reproduce adaptation results
Hi,

Congrats on the amazing work!

I am trying to reproduce the results of your fine-tuning CLIP on MS-COCO experiment (figure 5, section 5.2 in the paper) -- however I am running into issues while doing the exact fine-tuning and am getting lower SIMAT scores than reported in your paper.

These are the exact training loss and validation loss plots while doing the fine-tuning:

Adaption at tau=0.01:

Adaption at tau=0.1:

I used the exact same hyperparameter settings as you have done in your adaptation script:

learning rate=1e-3

lr decay schedule with step_size=25 and gamma=0.1

num_epochs=50

gradient clipping at norm=1

Do you have any insights into where the training script might be going wrong and why the loss seems to be stagnating as we step through training? After adaptation (with these training plots), I get a SIMAT score of 37.10 compared to your 47.5 (at tau=0.1, lambda=1). Similarly, I get a SIMAT score of around 16.61 compared to your 17.10 (at tau=0.01, lambda=1).

Note: I had to reimplement chunks of your code in pytorch as I believe parts of your adaptation script were incomplete in Pytorch lightning -- I would be happy to share my adaptation script with you if that would help!

Hoping for a prompt response.
opened by vishaal27 0

This repository contains the database and code used in the paper Embedding Arithmetic for Text-driven Image Transformation

Related tags

Overview

The SIMAT Dataset

Examples

Download dataset

Perform inference with CLIP ViT-B/32

Compute SIMAT scores with CLIP

Train adaptation layers on COCO

Citation

References

License

Comments

SIMAT score of zero-shot CLIP not reproduced

Adding Code of Conduct file

Adding Contributing file

RuntimeError when finetuning CLIP adaptation layers

Problem on reproducing adaptation layer

Unable to reproduce adaptation results

Owner

Meta Research

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

This repository contains a PyTorch implementation of "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis".

This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

This repository contains the data and code for the paper "Diverse Text Generation via Variational Encoder-Decoder Models with Gaussian Process Priors" (SPNLP@ACL2022)

the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Apply a perspective transformation to a raster image inside Inkscape (no need to use an external software such as GIMP or Krita).

This repo contains the code and data used in the paper "Wizard of Search Engine: Access to Information Through Conversations with Search Engines"

This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the time series forecasting research space.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

This repository contains the code and models for the following paper.

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Code for the paper "Query Embedding on Hyper-relational Knowledge Graphs"

The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding"