Measuring if attention is explanation with ROAR

Overview

NLP ROAR Interpretability

Official code for: Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining

Plot of ROAR and Recursive ROAR faithfulness curves

Install

git clone https://github.com/AndreasMadsen/nlp-roar-interpretability.git
cd nlp-roar-interpretability
python -m pip install -e .

Experiments

Tasks

There are scripts for each dataset. Note that some tasks share a dataset. Use this list to identify how to train a model for each task.

  • SST: python experiments/stanford_sentiment.py
  • SNLI: python experiments/stanford_nli.py
  • IMDB: python experiments/imdb.py
  • MIMIC (Diabetes): python experiments/mimic.py --subset diabetes
  • MIMIC (Anemia): python experiments/mimic.py --subset anemia
  • bABI-1: python experiments/babi.py --task 1
  • bABI-2: python experiments/babi.py --task 2
  • bABI-3: python experiments/babi.py --task 3

Parameters

Each of the above scripts stanford_sentiment, stanford_nli, imdb, mimic, and babi take the same set of CLI arguments. You can learn about each argument with --help. The most important arguments which will allow you to run the experiments presented in the paper are:

  • --importance-measure: this specifies which importance measure is used. It can be either random, mutual-information, attention , gradient, or integrated-gradient.
  • --seed: specifies the seed used to initialize the model.
  • --roar-strategy: should ROAR masking be done absoloute (count) or relative (quantile),
  • --k: the proportion of tokens in % to mask if --roar-strategy quantile is used. The number of tokens if --roar-strategy count is used.
  • --recursive: indicates that model to use for computing the importance measure has --k set to --k - --recursive-step-size instead of 0 as used in classic ROAR.

Note, for --k > 0, the reference model must already be trained. For example, in the non-recursive case, this means that a model trained with --k 0 must already available.

Running on a HPC setup

For downloading dataset dependencies we provide a download.sh script.

Additionally, we provide script for submitting all jobs to a Slurm queue, in batch_jobs/. Note again, that the ROAR script assume there are checkpoints for the baseline --k 0 models.

The jobs automatically use $SCRATCH/nlproar as the presistent dir.

MIMIC

See https://mimic.physionet.org/gettingstarted/access/ for how to access MIMIC. You will need to download DIAGNOSES_ICD.csv.gz and NOTEEVENTS.csv.gz and place them in mimic/ relative to your presistent dir.

Comments
  • Run multiple seeds per job

    Run multiple seeds per job

    Just an idea, because I don't think we are supposed to run a ton of 2-5 minute jobs. But it might too complex to setup.

    ~~work in progress: I didn't test it at all. It is kinda a lot of bash magic, and I'm not strong in bash. I would properly do the same for babi and imdb.~~

    Based on https://github.com/AndreasMadsen/python-comp550-interpretability/pull/33

    opened by AndreasMadsen 9
  • implement importance-measure cache and set riemann_sampels=50

    implement importance-measure cache and set riemann_sampels=50

    Give that we will be setting riemann_samples=50 which can take 1.5h just to compute, we could cache the importance measure which will benefit the non-recursive case.

    Still need to set up the slurm job dependencies and make sure it actually works.

    opened by AndreasMadsen 8
  • Abstract datasets intro base-classes

    Abstract datasets intro base-classes

    I need the datasets to be subclasses for TorchScript support, and we will anyway need it if we add more datasets. Unfortunately the cache will have to be rebuild, becuase this changes some of the .pkl file formats and filenames.

    • This fixes #14.
    • This sets the seed for some of the split generation. That could properly be done better, but at least it is set.
    • For babi it just hardcodes in the labels, instead of computing them dynammically. That simplifies things a lot. I noticed there were some odd labels, I don't know if they are actually used. But may be worth looking into.
    • If the encode dataset and vocab exists, it will not build any of the intermediate files. Should make syncing less expensive.
    • The test dataset now uses the test dataset instead of the validation dataset. (Recently introduced bug)

    ~~work in progress: Everything should work. I ran a few iterations locally and rebuild the cache. I need to rerun experiments on compute-canada to make sure it works completely.~~

    opened by AndreasMadsen 7
  • Attention sparsity

    Attention sparsity

    Scripts for quantitatively evaluating the sparsity of attention.

    The general idea is to compute the number of tokens that make up the largest 95% of the attention mass (i.e. the smallest number of tokens that have a cumulative mass of 0.95).

    Other ideas for quantifying the sparsity of attention:

    • Compute Gini index for each attention distribution (from each example) in each dataset and aggregate.
    opened by ncmeade 5
  • Compute importance measure in mini-batches

    Compute importance measure in mini-batches

    This should work. I've tested batch_size=8 aganist batch_size=1 and get identical results.

    ~~What is left is to run this on compute-canada and make sure there is no out-of-memory issues. Also, maybe compute_batch_size should just be base_dataset.batch_size. I will test that too.~~

    opened by AndreasMadsen 5
  • Riemann samples

    Riemann samples

    Jobs to pick a suitable Riemann sample size for integrated gradients.

    ~~Branched from #17. I believe the merge order will now be #18 -> #17 -> #24.~~

    wip 
    opened by ncmeade 4
  • Ready for compute

    Ready for compute

    ~~Based on https://github.com/AndreasMadsen/python-comp550-interpretability/pull/22, which will need to land first.~~

    Properly the last fixes I will make before submitting to compute-canada

    Finally managed to make some significant progress on the performance issues:

    • Use TorchScript for ROAR computation. Only makes a small difference.
    • Load and save list instead of torch.tensor in the ROAR .pkl. This makes a large difference for the garbage collector. For some reason loading torch.tensor via pickle does not seam to be GC'ed correctly.
    • Use pin_memory=True in dataloader. Appears to make training a bit faster and is generally recommended for performance on CUDA devices.
    • Compute gradient wrt. embedding instead of one_hot encoding. This makes a huge difference in both memory and performance. Constructing the one_hot encoding took up an enormous amount of memory and had to be done on the CPU, hence transfering it to the GPU was also slow. The new version gives identical gradients, but computes gradients over the embedding layer manually.

    I've updated the batch_jobs allocated time and also annotated how much time each job actually took. Looks like we could reduce the memory allocation too, but it shouldn't really matter since 1GPU=32GB RAM on Beluga anyway.

    opened by AndreasMadsen 4
  • upgrade and restrict modules

    upgrade and restrict modules

    ~~I sent a support request to make torchtext=0.9.1 available and fix sklearn for python 3.8, this needs to happen first. Before this can land.~~

    This removes 'en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz' from setup.py. It is still a dependency but it's such a pain to deal with, it's properly best to just have it installed with python -m spacy download en_core_web_sm locally.

    opened by AndreasMadsen 3
  • Cedar Setup

    Cedar Setup

    Creates a script for each cluster and cpu/gpu. There is then a bash function that detects which to use. There is very little difference, so maybe there is a better way. But this was the simplest I could think of now.

    Note there is no python_cedar_cpu_job.sh because we don't have an allocation on cedar. We could add a script that just uses the default allocation.

    opened by AndreasMadsen 2
  • ROAR v2: Gradient, Recursive ROAR [WIP]

    ROAR v2: Gradient, Recursive ROAR [WIP]

    Features:

    • [x] Implemented Gradient Importance Measure
    • [x] Implemented recursive ROAR
    • [x] Merged ROAR and non-roar experiment scripts
    • [x] Compute Canada integration

    Issues:

    • ~~Times too short for gradient measure (IMDB and Babi-3).~~
    • Mimic vocab size and embedding size don't match (diabetese: 18881 vs 18883, anemia: 16261 vs 16263). This directly prevents gradient calculations. The underlying bug properly affects MIMIC results a bit.
    • Checkpoints not found in the recursive case, even though the files are there (Mimic-d, IMDB, Babi). Kinda confused why only some job categories suffer.
    opened by AndreasMadsen 2
  • Issue with gzip and uncompressing

    Issue with gzip and uncompressing

    This is minor issue and can potentially be dealt with later. This is for documentation.

    When uncompressing some gzipped CSVs (as done in export/riemann_samples.py), the gzip module sometimes throws a BadGzipFile exception. This looks to be caused by "trailing garbage" that is appended to the end of some gzip files. Additional information on this issue can be found here:

    • https://stackoverflow.com/questions/4928560/how-can-i-work-with-gzip-files-which-contain-extra-data
    • https://bugs.python.org/issue24301#msg245369

    To reproduce this issue, execute the following on Beluga:

    import pandas as pd
    
    pd.read_csv("/scratch/anmadc/shared/babi-2-pre_s-0_m-i_rs-50.csv.gz", compression="infer")
    

    which should output:

    BadGzipFile: Not a gzipped file (b'\x17\xc2')
    

    For now, I've avoided this issue by manually uncompressing /scratch/anmadc/shared/babi-2-pre_s-0_m-i_rs-50.csv.gz using gunzip. Some of the MIMIC files may have this issue.

    opened by ncmeade 1
  • TODO before submitting to conference

    TODO before submitting to conference

    must do

    • [ ] Remove reference to compute canada. i.e. delete batch_jobs, *_job.sh, and Makefile from zip.
    • [x] Rename comp550 to nlproar or something similar.
    • [ ] Anonymize setup.py in zip.

    should do

    • [x] update doc-strings
    opened by AndreasMadsen 0
Owner
Andreas Madsen
Researching interpretability for Machine Learning because society needs it.
Andreas Madsen
This code reproduces the results of the paper, "Measuring Data Leakage in Machine-Learning Models with Fisher Information"

Fisher Information Loss This repository contains code that can be used to reproduce the experimental results presented in the paper: Awni Hannun, Chua

Facebook Research 43 Dec 30, 2022
The coda and data for "Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach" (ACL '21)

We propose a hierarchical core-fringe learning framework to measure fine-grained domain relevance of terms ā€“ the degree that a term is relevant to a broad (e.g., computer science) or narrow (e.g., deep learning) domain.

Jie Huang 14 Oct 21, 2022
CausaLM: Causal Model Explanation Through Counterfactual Language Models

CausaLM: Causal Model Explanation Through Counterfactual Language Models Authors: Amir Feder, Nadav Oved, Uri Shalit, Roi Reichart Abstract: Understan

Amir Feder 39 Jul 10, 2022
Measuring and Improving Consistency in Pretrained Language Models

ParaRel ?? This repository contains the code and data for the paper: Measuring and Improving Consistency in Pretrained Language Models as well as the

Yanai Elazar 26 Dec 2, 2022
CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms

CARLA - Counterfactual And Recourse Library CARLA is a python library to benchmark counterfactual explanation and recourse models. It comes out-of-the

Carla Recourse 200 Dec 28, 2022
This repository contains the implementation of the paper: "Towards Frequency-Based Explanation for Robust CNN"

RobustFreqCNN About This repository contains the implementation of the paper "Towards Frequency-Based Explanation for Robust CNN" arxiv. It primarly d

Sarosij Bose 2 Jan 23, 2022
Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

Phil Wang 272 Dec 23, 2022
Implementation of the šŸ˜‡ Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

Phil Wang 189 Nov 22, 2022
Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

Phil Wang 109 Dec 28, 2022
PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

Under construction... Attention in Attention Network for Image Super-Resolution (A2N) This repository is an PyTorch implementation of the paper "Atten

Haoyu Chen 71 Dec 30, 2022
Attention-driven Robot Manipulation (ARM) which includes Q-attention

Attention-driven Robotic Manipulation (ARM) This codebase is home to: Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation I

Stephen James 84 Dec 29, 2022
Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

LESA Introduction This repository contains the official implementation of Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Cont

Chenglin Yang 20 Dec 31, 2021
Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

mandos 43 Dec 7, 2022
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(nĀ²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 5, 2023
Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

null 120 Dec 15, 2022
Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Deformable Attention Implementation of Deformable Attention from this paper in Pytorch, which appears to be an improvement to what was proposed in DET

Phil Wang 128 Dec 24, 2022
Graph neural network message passing reframed as a Transformer with local attention

Adjacent Attention Network An implementation of a simple transformer that is equivalent to graph neural network where the message passing is done with

Phil Wang 49 Dec 28, 2022
Implementation of Lie Transformer, Equivariant Self-Attention, in Pytorch

Lie Transformer - Pytorch (wip) Implementation of Lie Transformer, Equivariant Self-Attention, in Pytorch. Only the SE3 version will be present in thi

Phil Wang 78 Oct 26, 2022
Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

SE3 Transformer - Pytorch Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch. May be needed for replicating Alphafold2 resu

Phil Wang 207 Dec 23, 2022