Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

Hila Chefer

Last update: Jan 7, 2023

Related tags

Deep Learning visualization transformers transformer vqa interpretability explainable-ai explainability detr lxmert visualbert

Overview

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

1 Using Colab

Please notice that the notebook assumes that you are using a GPU. To switch runtime go to Runtime -> change runtime type and select GPU.
Installing all the requirements may take some time. After installation, please restart the runtime.

2 Running Examples

Notice that we have two jupyter notebooks to run the examples presented in the paper.

The notebook for LXMERT contains both the examples from the paper and examples with images from the internet and free form questions. To use your own input, simply change the URL variable to your image and the question variable to your free form question.
The notebook for DETR contains the examples from the paper. To use your own input, simply change the URL variable to your image.

3 Reproduction of results

3.1 VisualBERT

Run the run.py script as follows:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=`pwd` python VisualBERT/run.py --method=<method_name> --is-text-pert=<true/false> --is-positive-pert=<true/false> --num-samples=10000 config=projects/visual_bert/configs/vqa2/defaults.yaml model=visual_bert dataset=vqa2 run_type=val checkpoint.resume_zoo=visual_bert.finetuned.vqa2.from_coco_train env.data_dir=/path/to/data_dir training.num_workers=0 training.batch_size=1 training.trainer=mmf_pert training.seed=1234

Note

If the datasets aren't already in env.data_dir, then the script will download the data automatically to the path in env.data_dir.

3.2 LXMERT

Download valid.json:

pushd data/vqa
wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/valid.json
popd

Download the COCO_val2014 set to your local machine.

Note

If you already downloaded COCO_val2014 for the VisualBERT tests, you can simply use the same path you used for VisualBERT.

Run the perturbation.py script as follows:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=`pwd` python lxmert/lxmert/perturbation.py  --COCO_path /path/to/COCO_val2014 --method <method_name> --is-text-pert <true/false> --is-positive-pert <true/false>

3.3 DETR

Download the COCO dataset as described in the DETR repository. Notice you only need the validation set.
Lower the IoU minimum threshold from 0.5 to 0.2 using the following steps:
- Locate the cocoeval.py script in your python library path:
  
  find library path:
```
import sys
print(sys.path)
```
  find cocoeval.py:
```
cd /path/to/lib
find -name cocoeval.py
```
- Change the self.iouThrs value in the setDetParams function (which sets the parameters for the COCO detection evaluation) in the Params class as follows:
  
  insead of:
```
self.iouThrs = np.linspace(.5, 0.95, int(np.round((0.95 - .5) / .05)) + 1, endpoint=True)
```
  use:
```
self.iouThrs = np.linspace(.2, 0.95, int(np.round((0.95 - .2) / .05)) + 1, endpoint=True)
```

Run the segmentation experiment, use the following command:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=`pwd`  python DETR/main.py --coco_path /path/to/coco/dataset  --eval --masks --resume https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth --batch_size 1 --method <method_name>

4 Credits

VisualBERT implementation is based on the MMF framework.
LXMERT implementation is based on the offical LXMERT implementation and on Hugging Face Transformers.
DETR implementation is based on the offical DETR implementation

Comments

CLIP ViT-B/16
Hi Hila, thanks for your great work. I am trying to run the CLIP code from the notebook on the ViT-B/16 model, but I am getting attention maps that don't make any sense (not able to get similar results to what's in the notebook). For the ViT-B/32 model, I'm able to reproduce the results, but for some reason the ViT-B/16 model is causing an issue. Do you know why this is? The only things in the code I needed to change are:

Add the link to ViT-B/16 in the _MODELS dictionary in CLIP/clip/clip.py

Change the reshape in interpret from (1, 1, 7, 7) to (1, 1, dim, dim) where dim = int(image_relevance.numel() ** 0.5) Thanks!
opened by sanjayss34 11
Swin Transformer
Hello Hila, Thank you for your great work. It is impressive. Right now, I am working on visualizing attention maps with Swin Transformer. Your work brings me some interesting insights. In your code CLIP-explainability.ipynb

for blk in image_attn_blocks: grad = blk.attn_grad cam = blk.attn_probs cam = cam.reshape(-1, cam.shape[-1], cam.shape[-1]) grad = grad.reshape(-1, grad.shape[-1], grad.shape[-1]) cam = grad * cam cam = cam.clamp(min=0).mean(dim=0) R += torch.matmul(cam, R)

the shapes of grad and cam are supposed to be consistent in attention blocks. However, in Swin Transformer, the patch size changes across blocks, which results in different attention sizes. Can you give me some advice on how can I apply your work to generate relevance in Swin Transformer? Thank you for your time. Best wishes, Kevin
opened by KP-Zhang 7
Request for vanilla example notebook
Wonderful paper. To see if I'm getting something very wrong this is my understanding of the two papers differences.

Transformer-Explainability You generate the building blocks with relprop as an alongside function to propagate backward the relevancies.

This is LRP, working from the CAMs (Class Activation Mapping) backwards. So you propagate outputs to inputs. To back propogate you have to hard code in all the flow - i.e. all the concats and splits of data, eg when you have to diverge to cam1, cam2 then use /=2 for matrix multiplication then rejoin them in the clone during self attention. This is awkward, and is what you're referring to when you say 'LRP requires a custom implementation of all network layers.' in the MM paper.

Transformer-MM-Explainability You work forwards and simply add the methods

get_attn save_attn save_attn_gradients get_attn_gradients

and the hooks in the forward pass to save them. This makes tracking things far easier as you don't need to reverse engineer the flow.

Request You have great alterations of the DETR, CLIP, LXMERT and VisualBERT repos allowing all the interaction coupling scores for the baselines and your method to be calculated and smoothly plug-in to the whole repo.

Could you provide an example using the forward pass formulation of Transformer-MM-Explainability on a Vanilla Transformer model (just your interaction scores, not all the other baseline methods) to act as a very simple demonstrative example which is single-modality, ideally a Jupyter notebook with comments.
opened by oliverdutton 7
Question about Vit

Thanks for the great work. I want to apply your Vit work (both CVPR2021 work and ICCV2021 work) from base 224 vision to vit_base_patch16_384, because I think it will have better result of relevancy map? Can I directly modify the config in here to 384 x 384 config and download the pre-trained weight for 384 version? Or do I need to make other changes?

Thank you in advance for your help.

opened by scott870430 6
Questions about CLIP visualization.

I do not understand why the visualization of CLIP only calculates the last two layers, due to the existence of num_layers. Could you share some insights with us?

The existence of variable "num_layers" makes the heat map of CLIP clearer, but I think visualization of CLIP and ViT should be similar or at least comparable because they share the same architecture.

opened by tingxueronghua 4
COCO 2014 or 2017?
Dear @hila-chefer ,

Thank you for releasing this repo of your fascinating work!!

Would you mind clarifying these two questions about your results for me? :)

Were you doing Detection or Segmentation? (i.e. evaluated on bounding-boxes or polygons?) As I see these two words used interchangeably in Table 1 and Fig. 6.

Were your COCO results evaluated on COCO 2014 or COCO 2017? (in the ReadMe I see "COCO_val2014" but in coco.py it reads "val2017.json". I could not find this detail in the paper.

Thank you so much!

Anh
opened by anguyen8 4
Using the methods for a custom architecture

Hello! Thank you for the excellent work and examples, also really appreciate the time you've taken to respond to queries.

In this spirit, I attempted to implement the mechanism to understand a multi-modal architecture of https://github.com/facebookresearch/av_hubert which is necessarily frames of a video -> text (with an encoder which does video frames -> tokens and a decoder that takes this as input and provides words as output).

To summarize what I attempted to do, I took references from the CLIP example and saved the attention weights from the output projection, and then used the gradients to compute the grad-cam and work further.

But it did not work out as expected in the context of this problem. I have attached an example visualization of the output that I got.

So to potentially TL;DR my questions (1) If I have a custom architecture which is not a straightforward classification task, what modifications do I need to make to incorporate the rules presented in the paper? (2) are the 'attn_probs' the softmax attention output of each layer in the CLIP example (I assumed as such, but was not sure based on the code).

Example of the output I am getting

Thank you in advance!

opened by SreeHarshaNelaturu 3
Problems with running it in Google colab

The firt cell went through, however with the following error message: ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. xarray-einstats 0.2.2 requires numpy>=1.21, but you have numpy 1.19.2 which is incompatible. torchtext 0.13.0 requires torch==1.12.0, but you have torch 1.7.0 which is incompatible. torchaudio 0.12.0+cu113 requires torch==1.12.0, but you have torch 1.7.0 which is incompatible. tensorflow 2.8.2+zzzcolab20220527125636 requires numpy>=1.20, but you have numpy 1.19.2 which is incompatible. pymc3 3.11.5 requires scipy<1.8.0,>=1.7.3, but you have scipy 1.5.2 which is incompatible. fastai 2.7.6 requires torchvision>=0.8.2, but you have torchvision 0.8.1 which is incompatible. datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible. albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.

Then it stopped in the second cell. First with from lxmert.lxmert.src.modeling_frcnn import GeneralizedRCNN, I got AttributeError: module 'PIL.Image' has no attribute 'Resampling'. I pip install Pillow==9.0.0 following https://stackoverflow.com/questions/71738218/module-pil-has-not-attribute-resampling. Afterwards, I obtained AttributeError: module 'matplotlib.cbook' has no attribute '_deprecate_privatize_attribute' with from captum.attr import visualization. I can not find solution to get over this error.

Most grateful if you could help.

opened by songhuadan 2
Question about the visualization of CLIP‘s text token

An excellent work. And I notice you just provided a explame to visualize the image token in CLIP's image encoder. Is it able to visualize the text token in CLIP? If it is ok, how can I do this?

opened by Kihensarn 2
How can I choose the method when I run the script ？

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=pwd python VisualBERT/run.py --method=<method_name> --is-text-pert=<true/false> --is-positive-pert=<true/false> --num-samples=10000 config=projects/visual_bert/configs/vqa2/defaults.yaml model=visual_bert dataset=vqa2 run_type=val checkpoint.resume_zoo=visual_bert.finetuned.vqa2.from_coco_train env.data_dir=/path/to/data_dir training.num_workers=0 training.batch_size=1 training.trainer=mmf_pert training.seed=1234

opened by Shuai-Lv 2

Generate relevance matrix in ViT of Hugging Face

Hi, thank you for this great work!

I have trained a Transformer model with ViT - HuggingFace. When I tried to visualise the attention maps I found your work. I am quite interesting but I find your code and HuggingFace's are different. I tried to modify the source code like this.

class ViTLayer(nn.Module):

    def save_attn_gradients(self, attn_gradients):
        self.attn_gradients = attn_gradients

    def forward(self, hidden_states, head_mask=None, output_attentions=False):
        self_attention_outputs = self.attention(
            self.layernorm_before(hidden_states),  # in ViT, layernorm is applied before self-attention
            head_mask,
            output_attentions=output_attentions,
        )

        self_attention_outputs.register_hook(self.save_attn_gradients)

I am new to Transformer. I am not sure whether I register the hook in the right tensor. Can you help me check it?

Thank you very much!

opened by SketchX-QZY 2

Application to Sparse/Low-Rank Attention Matrices

Hello, Excellent work!

I was wondering if this explanation method is applicable to efficient transformers (such as those summarized in: https://arxiv.org/abs/2009.06732) that use lower rank or sparse attention matrices? In its current form, you would need the full, square (nxn) attention matrix for generation explanations. How can one adapt your method to an efficient transformer, such as the Reformer (https://arxiv.org/abs/2009.06732)?

opened by FarzanT 0
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
Details about the changes in the code of base models

I am trying to study the code in this repository. However, it is difficult to figure out the changes that have been made in sub folders of the Base Models (VisualBERT, LXMERT, DETR, etc) for this project.

Since the original repository of the Base Models may have changes after the code has been copied to this repository (i.e. their histories may not align), it becomes difficult to compare the Git diff.

It would be helpful if it is possible to attach the git commit tag/id of the Base Models repositories corresponding to the latest commit when they were cloned. Using the commit tag, it will be convenient to align the original code with the code in this repository and compare the changes in the model.

Additionally, it may be helpful for future research, but probably time consuming if those changes can be documented.

opened by NikhilM98 1

Owner

Hila Chefer

MSc Student @ Tel Aviv University & Intern @ Microsoft's Innovation Labs

GitHub

This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

225 Dec 29, 2022

[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

DAB-DETR This is the official pytorch implementation of our ICLR 2022 paper DAB-DETR. Authors: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi

336 Dec 25, 2022

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

[ICCV 2021] Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation

MAED: Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation Getting Started Our codes are implemented and tested with pyth

176 Dec 15, 2022

An implementation of a sequence to sequence neural network using an encoder-decoder

Keras implementation of a sequence to sequence model for time series prediction using an encoder-decoder architecture. I created this post to share a

195 Dec 17, 2022

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

6 Dec 23, 2022

git《Self-Attention Attribution: Interpreting Information Interactions Inside Transformer》(AAAI 2021) GitHub:

Self-Attention Attribution This repository contains the implementation for AAAI-2021 paper Self-Attention Attribution: Interpreting Information Intera

60 Dec 29, 2022

[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

OW-DETR: Open-world Detection Transformer (CVPR 2022) [Paper] Akshita Gupta*, Sanath Narayan*, K J Joseph, Salman Khan, Fahad Shahbaz Khan, Mubarak Sh

127 Dec 27, 2022

Explainability for Vision Transformers (in PyTorch)

Explainability for Vision Transformers (in PyTorch) This repository implements methods for explainability in Vision Transformers

442 Jan 4, 2023

This repository contains the data and code for the paper "Diverse Text Generation via Variational Encoder-Decoder Models with Gaussian Process Priors" (SPNLP@ACL2022)

GP-VAE This repository provides datasets and code for preprocessing, training and testing models for the paper: Diverse Text Generation via Variationa

18 Dec 29, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

DeepLabv3+：Encoder-Decoder with Atrous Separable Convolution语义分割模型在tensorflow2当中的实现

DeepLabv3+：Encoder-Decoder with Atrous Separable Convolution语义分割模型在tensorflow2当中的实现目录性能情况 Performance 所需环境 Environment 注意事项 Attention 文件下载 Download

31 Nov 25, 2022

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

129 Dec 11, 2022

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

12.6k Jan 9, 2023

Neural Module Network for VQA in Pytorch

Neural Module Network (NMN) for VQA in Pytorch Note: This is NOT an official repository for Neural Module Networks. NMN is a network that is assembled

111 Nov 24, 2022

[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers This is the official PyTorch implementation and models for UP-DETR paper: @a

430 Dec 23, 2022

Official implementation of GraphMask as presented in our paper Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking.

GraphMask This repository contains an implementation of GraphMask, the interpretability technique for graph neural networks presented in our ICLR 2021

29 Sep 2, 2022

Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

The DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings and that the spatial embeddings make minor contributions, increasing the need for high-quality content embeddings and thus increasing the training difficulty.

281 Dec 30, 2022

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

Bottom-Up and Top-Down Attention for Visual Question Answering An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge. The

731 Jan 3, 2023

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

Related tags

Overview

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

1 Using Colab

2 Running Examples

3 Reproduction of results

3.1 VisualBERT

3.2 LXMERT

3.3 DETR

4 Credits

Comments

Patching CVE-2007-4559

Owner

Hila Chefer

This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

[ICCV 2021] Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation

An implementation of a sequence to sequence neural network using an encoder-decoder

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

git《Self-Attention Attribution: Interpreting Information Interactions Inside Transformer》(AAAI 2021) GitHub:

[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

Explainability for Vision Transformers (in PyTorch)

This repository contains the data and code for the paper "Diverse Text Generation via Variational Encoder-Decoder Models with Gaussian Process Priors" (SPNLP@ACL2022)

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

DeepLabv3+：Encoder-Decoder with Atrous Separable Convolution语义分割模型在tensorflow2当中的实现

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Neural Module Network for VQA in Pytorch

[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

Official implementation of GraphMask as presented in our paper Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking.

Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.