This repository contains the database and code used in the paper Embedding Arithmetic for Text-driven Image Transformation

Related tags

Deep Learning SIMAT
Overview

This repository contains the database and code used in the paper Embedding Arithmetic for Text-driven Image Transformation (Guillaume Couairon, Holger Schwenk, Matthijs Douze, Matthieu Cord)

The inspiration for this work are the geometric properties of word embeddings, such as Queen ~ Woman + (King - Man). We extend this idea to multimodal embedding spaces (like CLIP), which let us semantically edit images via "delta vectors".

Transformed images can then be retrieved in a dataset of images.

The SIMAT Dataset

We build SIMAT, a dataset to evaluate the task of text-driven image transformation, for simple images that can be characterized by a single subject-relation-object annotation. A transformation query is a pair (image, query) where the query asks to change the subject, the relation or the object in the input image. SIMAT contains ~6k images and an average of 3 transformation queries per image.

The goal is to retrieve an image in the dataset that corresponds to the query specifications. We use OSCAR as an oracle to check whether retrieved images are correct with respect to the expected modifications.

Examples

Below are a few examples that are in the dataset, and images that were retrieved for our best-performing algorithm.

Download dataset

The SIMAT database is composed of crops of images from Visual Genome. You first need to install Visual Genome and then run the following command :

python prepare_dataset.py --VG_PATH=/path/to/visual/genome

Perform inference with CLIP ViT-B/32

In this example, we use the CLIP ViT-B/32 model to edit an image. Note that the dataset of clip embeddings is pre-computed.

import clip
from torchvision import datasets
from PIL import Image
from IPython.display import display

#hack to normalize tensors easily
torch.Tensor.normalize = lambda x:x/x.norm(dim=-1, keepdim=True)

# database to perform the retrieval step
dataset = datasets.ImageFolder('simat_db/images/')
db = torch.load('data/clip_simat.pt').float()

model, prep = clip.load('ViT-B/32', device='cuda:0', jit=False)

image = Image.open('simat_db/images/A cat sitting on a grass/98316.jpg')
img_enc = model.encode_image(prep(image).unsqueeze(0).to('cuda:0')).float().cpu().detach().normalize()

txt = ['cat', 'dog']
txt_enc = model.encode_text(clip.tokenize(txt).to('cuda:0')).float().cpu().detach().normalize()

# optionally, we can apply a linear layer on top of the embeddings
heads = torch.load(f'data/head_clip_t=0.1.pt')
img_enc = heads['img_head'](img_enc).normalize()
txt_enc = heads['txt_head'](txt_enc).normalize()
db = heads['img_head'](db).normalize()


# now we perform the transformation step
lbd = 1
target_enc = img_enc + lbd * (txt_enc[1] - txt_enc[0])


retrieved_idx = (db @ target_enc.float().T).argmax(0).item()


display(dataset[retrieved_idx][0])

Compute SIMAT scores with CLIP

You can run the evaluation script with the following command:

python eval.py --backbone clip --domain dev --tau 0.01 --lbd 1 2

It automatically load the adaptation layer relative to the value of tau.

Train adaptation layers on COCO

In this part, you can train linear layers after the CLIP encoder on the COCO dataset, to get a better alignment. Here is an example :

python adaptation.py --backbone ViT-B/32 --lr 0.001 --tau 0.1 --batch_size 512

Citation

If you find this paper or dataset useful for your research, please use the following.

@article{gco1embedding,
  title={Embedding Arithmetic for text-driven Image Transformation},
  author={Guillaume Couairon, Matthieu Cord, Matthijs Douze, Holger Schwenk},
  journal={arXiv preprint arXiv:2112.03162},
  year={2021}
}

References

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, OpenAI 2021

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, IJCV 2017

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao, Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020

License

The SIMAT is released under the MIT license. See LICENSE for details.

Comments
  • SIMAT score of zero-shot CLIP not reproduced

    SIMAT score of zero-shot CLIP not reproduced

    Hi! I'm trying to reproduce SIMAT score of zero shot CLIP in paper. (Table1 , Fig5) I used eval.py for reproducing this score, and change it slightly to remove adaptation layer.

    def simat_eval(args):
        #img_head, txt_head, emb_key='clip', lbds=[1], test=True:, tau
        # get heads !
        emb_key = 'clip'
        heads = torch.load(f'data/head_{emb_key}_t={args.tau}.pt')
        output = {}
        transfos = pd.read_csv('simat_db/transfos.csv', index_col=0)
        transfos = transfos[transfos.is_test == (args.domain == 'test')]
        img_embs = torch.load('data/clip_simat_2.pt').float()
    
    
        img_embs = img_embs.normalize()
        #img_embs = heads['img_head'](clip_simat).normalize()
        value_embs = torch.stack([img_embs[did] for did in transfos.dataset_id])
    
        word_embs = dict(torch.load(f'data/simat_words_{emb_key}_2.ptd'))
        #w2v = {k:heads['txt_head'](v.float()).normalize() for k, v in word_embs.items()}
        w2v = {k:(v.float()).normalize() for k, v in word_embs.items()}
        delta_vectors = torch.stack([w2v[x.target] - w2v[x.value] for i, x in transfos.iterrows()])
    
        oscar_scores = torch.load('simat_db/oscar_similarity_matrix.pt')
        weights = 1/np.array(transfos.norm2)**.5
        weights = weights/sum(weights)
    
        for lbd in args.lbds:
            target_embs = value_embs + lbd*delta_vectors
    
            nnb = (target_embs @ img_embs.T).topk(5).indices
    
            nnb_notself = [r[0] if r[0].item() != t else r[1] for r, t in zip(nnb, transfos.dataset_id)]
    
            scores = np.array([oscar_scores[ri, tc] for ri, tc in zip(nnb_notself, transfos.target_ids)]) > .5
    
    
            output[lbd] = 100*np.average(scores, weights=weights)
        return output
    
    

    when i use your embedding file (clip_simat.pt , simat_words_clip.ptd) it results well , but I cannot get full performance with the embedding that I made with encode.py.
    (I used the original CLIP model in openAI repository, so maybe it is not problem of CLIP model.)

    I would be appreciated if you check it. Thank you

    bug 
    opened by junhyukso 1
  • Adding Code of Conduct file

    Adding Code of Conduct file

    This is pull request was created automatically because we noticed your project was missing a Code of Conduct file.

    Code of Conduct files facilitate respectful and constructive communities by establishing expected behaviors for project contributors.

    This PR was crafted with love by Facebook's Open Source Team.

    CLA Signed 
    opened by facebook-github-bot 0
  • Adding Contributing file

    Adding Contributing file

    This is pull request was created automatically because we noticed your project was missing a Contributing file.

    CONTRIBUTING files explain how a developer can contribute to the project - which you should actively encourage.

    This PR was crafted with love by Facebook's Open Source Team.

    CLA Signed 
    opened by facebook-github-bot 0
  • RuntimeError when finetuning CLIP adaptation layers

    RuntimeError when finetuning CLIP adaptation layers

    Hi,

    When I ran the command python adaptation.py --backbone ViT-B/32 --lr 0.001 --tau 0.1 --batch_size 512 --gpus 2, I encountered the error. The error is as follows :

      ...
      File "/usr/.conda/envs/simat/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 93, in forward
        return self.module.validation_step(*inputs, **kwargs)
      File "adaptation.py", line 99, in validation_step
        img_ = self.core.encode_image(img).detach()
      File "/usr/.conda/envs/simat/lib/python3.8/site-packages/clip/model.py", line 341, in encode_image
        return self.visual(image.type(self.dtype))
      File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/.conda/envs/simat/lib/python3.8/site-packages/clip/model.py", line 224, in forward
        x = self.conv1(x)  # shape = [*, width, grid, grid]
      File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward
        return self._conv_forward(input, self.weight, self.bias)
      File "/usr/.conda/envs/simat/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
        return F.conv2d(input, weight, bias, self.stride,
    RuntimeError: Given groups=1, weight of size [768, 3, 32, 32], expected input[1, 1536, 224, 224] to have 3 channels, but got 1536 channels instead
    

    I thought the error occurred since the size of the batch and the channel are not decoupled. To resolve the error, I inserted a line of code img = img.reshape(-1, 3, img.shape[-1], img.shape[-1]) which separates the batch and the channel at Line77 and Line90 in adaptation.py. Then, I could fully run the code without the error.

    Is my modification the right solution? I had another reproduction problem in this code, but I'm worried if it's the cause here.

    Thanks, Janet

    opened by parang99 0
  • Problem on reproducing adaptation layer

    Problem on reproducing adaptation layer

    Hi @PhazCode,

    Thanks for this great work.

    I tried to reproduce training new adaptation layers of CLIP on the COCO dataset. After training, I compute SIMAT scores with my trained weights and the results are as follows :

    lbd=1.0: 13.09
    lbd=2.0: 25.78
    lbd=3.0: 29.94
    lbd=4.0: 28.73
    lbd=5.0: 26.21
    

    When I compute SIMAT scores with provided weights, the results are as follows :

    lbd=1.0: 47.59
    lbd=2.0: 35.78
    lbd=3.0: 29.04
    lbd=4.0: 26.36
    lbd=5.0: 24.52
    

    For training, I followed the hyperparameters mentioned in paper. The other parameters were initialized with the default values at adaptation.py. The hyperparameters are as follows :

    • max_epochs : 30
    • batch_size : 4096
    • lr : 1e-3
    • tau : 0.1
    • sched_step_size : 25
    • sched_gamma : 0.1

    The command that I used was python adaptation.py --backbone ViT-B/32 --lr 1e-3 --tau 0.1 --batch_size 4096 --wandb --max_epochs 30 And I also ran the command written in README.md python adaptation.py --backbone ViT-B/32 --lr 0.001 --tau 0.1 --batch_size 512

    Unfortunately I couldn't retreive the same or similar results of the provided weights in both commands. Is there any mistakes that I did or any solution to resolve this problem?

    My settings are as follows:

    • Pytorch : 1.12.1
    • CUDA : 11.3
    • Python : 3.8.13
    • gpu : 8 Tesla V100 GPUs

    Thanks, Janet

    opened by parang99 0
  • Unable to reproduce adaptation results

    Unable to reproduce adaptation results

    Hi,

    Congrats on the amazing work!

    I am trying to reproduce the results of your fine-tuning CLIP on MS-COCO experiment (figure 5, section 5.2 in the paper) -- however I am running into issues while doing the exact fine-tuning and am getting lower SIMAT scores than reported in your paper.

    These are the exact training loss and validation loss plots while doing the fine-tuning:

    Adaption at tau=0.01:

    image

    Adaption at tau=0.1:

    image

    I used the exact same hyperparameter settings as you have done in your adaptation script:

    • learning rate=1e-3
    • lr decay schedule with step_size=25 and gamma=0.1
    • num_epochs=50
    • gradient clipping at norm=1

    Do you have any insights into where the training script might be going wrong and why the loss seems to be stagnating as we step through training? After adaptation (with these training plots), I get a SIMAT score of 37.10 compared to your 47.5 (at tau=0.1, lambda=1). Similarly, I get a SIMAT score of around 16.61 compared to your 17.10 (at tau=0.01, lambda=1).

    Note: I had to reimplement chunks of your code in pytorch as I believe parts of your adaptation script were incomplete in Pytorch lightning -- I would be happy to share my adaptation script with you if that would help!

    Hoping for a prompt response.

    opened by vishaal27 0
Owner
Meta Research
Meta Research
Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

null 19 Sep 29, 2022
An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

SVM Données Une base d’images contient 490 images pour l’apprentissage (400 voitures et 90 bateaux), et encore 21 images pour fait des tests. Prétrait

Achraf Rahouti 3 Nov 30, 2021
This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

About Repository This repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.' About Code

Arun Verma 1 Nov 9, 2021
This repository contains a PyTorch implementation of "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis".

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis | Project Page | Paper | PyTorch implementation for the paper "AD-NeRF: Audio

null 551 Dec 29, 2022
This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

RGB2NIR_Experimental This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models

null 5 Jan 4, 2023
This repository contains the data and code for the paper "Diverse Text Generation via Variational Encoder-Decoder Models with Gaussian Process Priors" (SPNLP@ACL2022)

GP-VAE This repository provides datasets and code for preprocessing, training and testing models for the paper: Diverse Text Generation via Variationa

Wanyu Du 18 Dec 29, 2022
the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

EmbedSeg Introduction This repository hosts the version of the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

JugLab 88 Dec 25, 2022
This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Predicting Patient Outcomes with Graph Representation Learning This repository contains the code used for Predicting Patient Outcomes with Graph Repre

Emma Rocheteau 76 Dec 22, 2022
Apply a perspective transformation to a raster image inside Inkscape (no need to use an external software such as GIMP or Krita).

Raster Perspective Apply a perspective transformation to bitmap image using the selected path as envelope, without the need to use an external softwar

s.ouchene 19 Dec 22, 2022
This repo contains the code and data used in the paper "Wizard of Search Engine: Access to Information Through Conversations with Search Engines"

Wizard of Search Engine: Access to Information Through Conversations with Search Engines by Pengjie Ren, Zhongkun Liu, Xiaomeng Song, Hongtao Tian, Zh

null 19 Oct 27, 2022
This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the time series forecasting research space.

TSForecasting This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the tim

Rakshitha Godahewa 80 Dec 30, 2022
null 190 Jan 3, 2023
This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

Deep Continuous Clustering Introduction This is a Pytorch implementation of the DCC algorithms presented in the following paper (paper): Sohil Atul Sh

Sohil Shah 197 Nov 29, 2022
This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

Omnimatte in PyTorch This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effect

Erika Lu 728 Dec 28, 2022
This repository contains the code and models for the following paper.

DC-ShadowNet Introduction This is an implementation of the following paper DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised

AuAgCu 65 Dec 27, 2022
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 5, 2022
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 5, 2022
Code for the paper "Query Embedding on Hyper-relational Knowledge Graphs"

Query Embedding on Hyper-Relational Knowledge Graphs This repository contains the code used for the experiments in the paper Query Embedding on Hyper-

DimitrisAlivas 19 Jul 26, 2022
The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding"

AutoSF The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding" and this paper has been accepted by ICDE2020. News:

AutoML Research 64 Dec 17, 2022