Measuring if attention is explanation with ROAR

Andreas Madsen

Last update: Nov 13, 2022

Related tags

Deep Learning nlp-roar-interpretability

Overview

NLP ROAR Interpretability

Official code for: Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining

Install

git clone https://github.com/AndreasMadsen/nlp-roar-interpretability.git
cd nlp-roar-interpretability
python -m pip install -e .

Experiments

Tasks

There are scripts for each dataset. Note that some tasks share a dataset. Use this list to identify how to train a model for each task.

SST: python experiments/stanford_sentiment.py
SNLI: python experiments/stanford_nli.py
IMDB: python experiments/imdb.py
MIMIC (Diabetes): python experiments/mimic.py --subset diabetes
MIMIC (Anemia): python experiments/mimic.py --subset anemia
bABI-1: python experiments/babi.py --task 1
bABI-2: python experiments/babi.py --task 2
bABI-3: python experiments/babi.py --task 3

Parameters

Each of the above scripts stanford_sentiment, stanford_nli, imdb, mimic, and babi take the same set of CLI arguments. You can learn about each argument with --help. The most important arguments which will allow you to run the experiments presented in the paper are:

--importance-measure: this specifies which importance measure is used. It can be either random, mutual-information, attention , gradient, or integrated-gradient.
--seed: specifies the seed used to initialize the model.
--roar-strategy: should ROAR masking be done absoloute (count) or relative (quantile),
--k: the proportion of tokens in % to mask if --roar-strategy quantile is used. The number of tokens if --roar-strategy count is used.
--recursive: indicates that model to use for computing the importance measure has --k set to --k - --recursive-step-size instead of 0 as used in classic ROAR.

Note, for --k > 0, the reference model must already be trained. For example, in the non-recursive case, this means that a model trained with --k 0 must already available.

Running on a HPC setup

For downloading dataset dependencies we provide a download.sh script.

Additionally, we provide script for submitting all jobs to a Slurm queue, in batch_jobs/. Note again, that the ROAR script assume there are checkpoints for the baseline --k 0 models.

The jobs automatically use $SCRATCH/nlproar as the presistent dir.

MIMIC

See https://mimic.physionet.org/gettingstarted/access/ for how to access MIMIC. You will need to download DIAGNOSES_ICD.csv.gz and NOTEEVENTS.csv.gz and place them in mimic/ relative to your presistent dir.

Comments

Run multiple seeds per job

Just an idea, because I don't think we are supposed to run a ton of 2-5 minute jobs. But it might too complex to setup.

~~work in progress: I didn't test it at all. It is kinda a lot of bash magic, and I'm not strong in bash. I would properly do the same for babi and imdb.~~

Based on https://github.com/AndreasMadsen/python-comp550-interpretability/pull/33

opened by AndreasMadsen 9
implement importance-measure cache and set riemann_sampels=50

Give that we will be setting riemann_samples=50 which can take 1.5h just to compute, we could cache the importance measure which will benefit the non-recursive case.

Still need to set up the slurm job dependencies and make sure it actually works.

opened by AndreasMadsen 8
Abstract datasets intro base-classes
I need the datasets to be subclasses for TorchScript support, and we will anyway need it if we add more datasets. Unfortunately the cache will have to be rebuild, becuase this changes some of the .pkl file formats and filenames.

This fixes #14.

This sets the seed for some of the split generation. That could properly be done better, but at least it is set.

For babi it just hardcodes in the labels, instead of computing them dynammically. That simplifies things a lot. I noticed there were some odd labels, I don't know if they are actually used. But may be worth looking into.

If the encode dataset and vocab exists, it will not build any of the intermediate files. Should make syncing less expensive.

The test dataset now uses the test dataset instead of the validation dataset. (Recently introduced bug)

~~work in progress: Everything should work. I ran a few iterations locally and rebuild the cache. I need to rerun experiments on compute-canada to make sure it works completely.~~
opened by AndreasMadsen 7
Attention sparsity
Scripts for quantitatively evaluating the sparsity of attention.

The general idea is to compute the number of tokens that make up the largest 95% of the attention mass (i.e. the smallest number of tokens that have a cumulative mass of 0.95).

Other ideas for quantifying the sparsity of attention:

Compute Gini index for each attention distribution (from each example) in each dataset and aggregate.
opened by ncmeade 5
Compute importance measure in mini-batches

This should work. I've tested batch_size=8 aganist batch_size=1 and get identical results.

~~What is left is to run this on compute-canada and make sure there is no out-of-memory issues. Also, maybe compute_batch_size should just be base_dataset.batch_size. I will test that too.~~

opened by AndreasMadsen 5
Riemann samples

Jobs to pick a suitable Riemann sample size for integrated gradients.

~~Branched from #17. I believe the merge order will now be #18 -> #17 -> #24.~~
wip

opened by ncmeade 4
Ready for compute
~~Based on https://github.com/AndreasMadsen/python-comp550-interpretability/pull/22, which will need to land first.~~

Properly the last fixes I will make before submitting to compute-canada

Finally managed to make some significant progress on the performance issues:

Use TorchScript for ROAR computation. Only makes a small difference.

Load and save list instead of torch.tensor in the ROAR .pkl. This makes a large difference for the garbage collector. For some reason loading torch.tensor via pickle does not seam to be GC'ed correctly.

Use pin_memory=True in dataloader. Appears to make training a bit faster and is generally recommended for performance on CUDA devices.

Compute gradient wrt. embedding instead of one_hot encoding. This makes a huge difference in both memory and performance. Constructing the one_hot encoding took up an enormous amount of memory and had to be done on the CPU, hence transfering it to the GPU was also slow. The new version gives identical gradients, but computes gradients over the embedding layer manually.

I've updated the batch_jobs allocated time and also annotated how much time each job actually took. Looks like we could reduce the memory allocation too, but it shouldn't really matter since 1GPU=32GB RAM on Beluga anyway.
opened by AndreasMadsen 4
upgrade and restrict modules

~~I sent a support request to make torchtext=0.9.1 available and fix sklearn for python 3.8, this needs to happen first. Before this can land.~~

This removes 'en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz' from setup.py. It is still a dependency but it's such a pain to deal with, it's properly best to just have it installed with python -m spacy download en_core_web_sm locally.

opened by AndreasMadsen 3
Cedar Setup

Creates a script for each cluster and cpu/gpu. There is then a bash function that detects which to use. There is very little difference, so maybe there is a better way. But this was the simplest I could think of now.

Note there is no python_cedar_cpu_job.sh because we don't have an allocation on cedar. We could add a script that just uses the default allocation.

opened by AndreasMadsen 2
ROAR v2: Gradient, Recursive ROAR [WIP]
Features:

[x] Implemented Gradient Importance Measure

[x] Implemented recursive ROAR

[x] Merged ROAR and non-roar experiment scripts

[x] Compute Canada integration

Issues:

~~Times too short for gradient measure (IMDB and Babi-3).~~

Mimic vocab size and embedding size don't match (diabetese: 18881 vs 18883, anemia: 16261 vs 16263). This directly prevents gradient calculations. The underlying bug properly affects MIMIC results a bit.

Checkpoints not found in the recursive case, even though the files are there (Mimic-d, IMDB, Babi). Kinda confused why only some job categories suffer.
opened by AndreasMadsen 2
Issue with gzip and uncompressing
This is minor issue and can potentially be dealt with later. This is for documentation.

When uncompressing some gzipped CSVs (as done in export/riemann_samples.py), the gzip module sometimes throws a BadGzipFile exception. This looks to be caused by "trailing garbage" that is appended to the end of some gzip files. Additional information on this issue can be found here:

https://stackoverflow.com/questions/4928560/how-can-i-work-with-gzip-files-which-contain-extra-data

https://bugs.python.org/issue24301#msg245369

To reproduce this issue, execute the following on Beluga:

import pandas as pd pd.read_csv("/scratch/anmadc/shared/babi-2-pre_s-0_m-i_rs-50.csv.gz", compression="infer")

which should output:

BadGzipFile: Not a gzipped file (b'\x17\xc2')

For now, I've avoided this issue by manually uncompressing /scratch/anmadc/shared/babi-2-pre_s-0_m-i_rs-50.csv.gz using gunzip. Some of the MIMIC files may have this issue.
opened by ncmeade 1
TODO before submitting to conference
must do

[ ] Remove reference to compute canada. i.e. delete batch_jobs, *_job.sh, and Makefile from zip.

[x] Rename comp550 to nlproar or something similar.

[ ] Anonymize setup.py in zip.

should do

[x] update doc-strings
opened by AndreasMadsen 0

Owner

Andreas Madsen

Researching interpretability for Machine Learning because society needs it.

GitHub

This code reproduces the results of the paper, "Measuring Data Leakage in Machine-Learning Models with Fisher Information"

Fisher Information Loss This repository contains code that can be used to reproduce the experimental results presented in the paper: Awni Hannun, Chua

43 Dec 30, 2022

The coda and data for "Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach" (ACL '21)

We propose a hierarchical core-fringe learning framework to measure fine-grained domain relevance of terms – the degree that a term is relevant to a broad (e.g., computer science) or narrow (e.g., deep learning) domain.

14 Oct 21, 2022

CausaLM: Causal Model Explanation Through Counterfactual Language Models

CausaLM: Causal Model Explanation Through Counterfactual Language Models Authors: Amir Feder, Nadav Oved, Uri Shalit, Roi Reichart Abstract: Understan

39 Jul 10, 2022

Measuring and Improving Consistency in Pretrained Language Models

ParaRel ?? This repository contains the code and data for the paper: Measuring and Improving Consistency in Pretrained Language Models as well as the

26 Dec 2, 2022

CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms

CARLA - Counterfactual And Recourse Library CARLA is a python library to benchmark counterfactual explanation and recourse models. It comes out-of-the

200 Dec 28, 2022

This repository contains the implementation of the paper: "Towards Frequency-Based Explanation for Robust CNN"

RobustFreqCNN About This repository contains the implementation of the paper "Towards Frequency-Based Explanation for Robust CNN" arxiv. It primarly d

2 Jan 23, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

189 Nov 22, 2022

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

109 Dec 28, 2022

PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

Under construction... Attention in Attention Network for Image Super-Resolution (A2N) This repository is an PyTorch implementation of the paper "Atten

71 Dec 30, 2022

Attention-driven Robot Manipulation (ARM) which includes Q-attention

Attention-driven Robotic Manipulation (ARM) This codebase is home to: Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation I

84 Dec 29, 2022

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

LESA Introduction This repository contains the official implementation of Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Cont

20 Dec 31, 2021

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

43 Dec 7, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

180 Jan 5, 2023

Measuring if attention is explanation with ROAR

Related tags

Overview

NLP ROAR Interpretability

Install

Experiments

Tasks

Parameters

Running on a HPC setup

MIMIC

Comments

Owner

Andreas Madsen

This code reproduces the results of the paper, "Measuring Data Leakage in Machine-Learning Models with Fisher Information"

The coda and data for "Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach" (ACL '21)

CausaLM: Causal Model Explanation Through Counterfactual Language Models

Measuring and Improving Consistency in Pretrained Language Models

CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms

This repository contains the implementation of the paper: "Towards Frequency-Based Explanation for Robust CNN"

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

Attention-driven Robot Manipulation (ARM) which includes Q-attention

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Graph neural network message passing reframed as a Transformer with local attention

Implementation of Lie Transformer, Equivariant Self-Attention, in Pytorch

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.