BERT Attention Analysis

Overview

BERT Attention Analysis

This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attention maps from BERT and writing them to disk, analyzing BERT's attention in general (sections 3 and 6 of the paper), and comparing its attention to dependency syntax (sections 4.2 and 5). We will add the code for the coreference resolution analysis (section 4.3 of the paper) soon!

Requirements

For extracting attention maps from text:

Additional requirements for the attention analysis:

Attention Analysis

Syntax_Analysis.ipynb and General_Analysis.ipynb contain code for analyzing BERT's attention, including reproducing the figures and tables in the paper.

You can download the data needed to run the notebooks (including BERT attention maps on Wikipedia and the Penn Treebank) from here. However, note that the Penn Treebank annotations are not freely available, so the Penn Treebank data only includes dummy labels. If you want to run the analysis on your own data, you can use the scripts described below to extract BERT attention maps.

Extracting BERT Attention Maps

We provide a script for running BERT over text and writing the resulting attention maps to disk. The input data should be a JSON file containing a list of dicts, each one corresponding to a single example to be passed in to BERT. Each dict must contain exactly one of the following fields:

  • "text": A string.
  • "words": A list of strings. Needed if you want word-level rather than token-level attention.
  • "tokens": A list of strings corresponding to BERT wordpiece tokenization.

If the present field is "tokens," the script expects [CLS]/[SEP] tokens to be already added; otherwise it adds these tokens to the beginning/end of the text automatically. Note that if an example is longer than max_sequence_length tokens after BERT wordpiece tokenization, attention maps will not be extracted for it. Attention extraction adds two additional fields to each dict:

  • "attns": A numpy array of size [num_layers, heads_per_layer, sequence_length, sequence_length] containing attention weights.
  • "tokens": If "tokens" was not already provided for the example, the BERT-wordpiece-tokenized text (list of strings).

Other fields already in the feature dicts will be preserved. For example if each dict has a tags key containing POS tags, they will stay in the data after attention extraction so they can be used when analyzing the data.

Attention extraction is run with

python extract_attention.py --preprocessed_data_file 
   
     --bert_dir 
    

    
   

The following optional arguments can also be added:

  • --max_sequence_length: Maximum input sequence length after tokenization (default is 128).
  • --batch_size: Batch size when running BERT over examples (default is 16).
  • --debug: Use a tiny BERT model for fast debugging.
  • --cased: Do not lowercase the input text.
  • --word_level: Compute word-level instead of token-level attention (see Section 4.1 of the paper).

The feature dicts with added attention maps (numpy arrays with shape [n_layers, n_heads_per_layer, n_tokens, n_tokens]) are written to _attn.pkl

Pre-processing Scripts

We include two pre-processing scripts for going from a raw data file to JSON that can be supplied to attention_extractor.py.

preprocess_unlabeled.py does BERT-pre-training-style preprocessing for unlabeled text (i.e, taking two consecutive text spans, truncating them so they are at most max_sequence_length tokens, and adding [CLS]/[SEP] tokens). Each line of the input data file should be one sentence. Documents should be separated by empty lines. Example usage:

python preprocess_unlabeled.py --data-file $ATTN_DATA_DIR/unlabeled.txt --bert-dir $ATTN_DATA_DIR/uncased_L-12_H-768_A-12

will create the file $ATTN_DATA_DIR/unlabeled.json containing pre-processed data. After pre-processing, you can run extract_attention.py to get attention maps, e.g.,

python extract_attention.py --preprocessed-data-file $ATTN_DATA_DIR/unlabeled.json --bert-dir $ATTN_DATA_DIR/uncased_L-12_H-768_A-12

preprocess_depparse.py pre-processes dependency parsing data. Dependency parsing data should consist of two files train.txt and dev.txt under a common directory. Each line in the files should contain a word followed by a space followed by - (e.g., 0-root). Examples should be separated by empty lines. Example usage:

python preprocess_depparse.py --data-dir $ATTN_DATA_DIR/depparse

After pre-processing, you can run extract_attention.py to get attention maps, e.g.,

python extract_attention.py --preprocessed-data-file $ATTN_DATA_DIR/depparse/dev.json --bert-dir $ATTN_DATA_DIR/uncased_L-12_H-768_A-12 --word_level

Computing Distances Between Attention Heads

head_distances.py computes the average Jenson-Shannon divergence between the attention weights of all pairs of attention heads and writes the results to disk as a numpy array of shape [n_heads, n_heads]. These distances can be used to cluster BERT's attention heads (see Section 6 and Figure 6 of the paper; code for doing this clustering is in General_Analysis.ipynb). Example usage (requires that attention maps have already been extracted):

python head_distances.py --attn-data-file $ATTN_DATA_DIR/unlabeled_attn.pkl --outfile $ATTN_DATA_DIR/head_distances.pkl

Citation

If you find the code or data helpful, please cite the original paper:

@inproceedings{clark2019what,
  title = {What Does BERT Look At? An Analysis of BERT's Attention},
  author = {Kevin Clark and Urvashi Khandelwal and Omer Levy and Christopher D. Manning},
  booktitle = {BlackBoxNLP@ACL},
  year = {2019}
}

Contact

Kevin Clark (@clarkkev).

Comments
  • How to deal with the OOV words

    How to deal with the OOV words

    My sample contains the word 'Silsby', but it does not exist in the vocab. How to deal with the OOV situation?

    python extract_attention.py --preprocessed-data-file samples_10.json --bert-dir data/cased_L-12_H-768_A-12  --max_sequence_length 256 --word_level --cased
    
    Creating examples...
    Traceback (most recent call last):
      File "extract_attention.py", line 144, in <module>
        main()
      File "extract_attention.py", line 108, in main
        example = Example(features, tokenizer, args.max_sequence_length)
      File "extract_attention.py", line 29, in __init__
        self.input_ids = tokenizer.convert_tokens_to_ids(self.tokens)
      File "/Users/smap10/Project/attention-analysis-master/bert/tokenization.py", line 182, in convert_tokens_to_ids
        return convert_by_vocab(self.vocab, tokens)
      File "/Users/smap10/Project/attention-analysis-master/bert/tokenization.py", line 143, in convert_by_vocab
        output.append(vocab[item])
    KeyError: 'Silsby'
    
    opened by BrambleXu 6
  • Fix pickle unicode error

    Fix pickle unicode error

    • error message: 'ascii' codec can't decode byte 0x84 in position 0: ordinal not in range(128)
    • ref: https://stackoverflow.com/questions/11305790/pickle-incompatibility-of-numpy-arrays-between-python-2-and-3
    opened by insop 2
  • for extract attention --word_level attention doesnt work

    for extract attention --word_level attention doesnt work

    I have used preprocess_unlabelled to create my json file. Can you provide a preprocess file which would help word level attention.

    When I use this file and when I pass the argument --word_level to extract_attention I get the following error: Converting to word-level attention... Traceback (most recent call last): File "extract_attention.py", line 144, in main() File "extract_attention.py", line 134, in main feature_dicts_with_attn, tokenizer, args.cased) File "/home/manasi/Documents/BERT/attention-analysis-master/bpe_utils.py", line 74, in make_attn_word_level words_to_tokens = tokenize_and_align(tokenizer, features["words"], cased) KeyError: 'words'

    Please help me fix it.

    opened by ManasiPat 2
  • File missing

    File missing

    First of I want to say this is a great insight into the BERT model. But there seems to be a file missing from General Analysis. The file as said can be used to generate one's own wiki data. (create_wiki_data.py). It would be quite helpful if you provide it .

    opened by imr555 2
  • An explanation for head_distances.py

    An explanation for head_distances.py

    Thank you for releasing the codes.

    An explanation for head_distances.py is missing in README. Could you add the explanation?

    A small question: in head_distances.py, the line utils.write_pickle(js_distances, args.outfile) should be outside the "for loop" for i, doc in enumerate(data):?

    opened by tomohideshibata 2
  • While extracting attention weights, set segment_ids all zeros ?

    While extracting attention weights, set segment_ids all zeros ?

    The code in this line means set segment ids to zeros for both segments, I don't know whether this is a bug?

    https://github.com/clarkkev/attention-analysis/blob/7b4ed20b2c58a211970ffc19c2d957b2c35ea0ea/extract_attention.py#L30

    opened by HuXiangkun 1
  • a question about preprocess_depparse.py

    a question about preprocess_depparse.py

    In readme ,I can see ''Each line in the files should contain a word followed by a space followed by <index_of_head>-<dependency_label> (e.g., 0-root).''

    So how can I get the index_of_head, and what's the meaning of it?

    Should I know which heads(12*12) in bert is for which syntactic function firstly ,and then judge it? Sorry for my pool English.

    opened by fudanchenjiahao 1
  • solve error if path not contain directory

    solve error if path not contain directory

    Solve error if path does not contain a directory

    Traceback (most recent call last):
      File "extract_attention.py", line 144, in <module>
        main()
      File "extract_attention.py", line 139, in main
        utils.write_pickle(feature_dicts_with_attn, outpath)
      File "/Users/smap10/Project/attention-analysis-master/utils.py", line 31, in write_pickle
        pickle.dump(o, f, -1)
      File "/anaconda3/envs/tf/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 106, in write
        self._prewrite_check()
      File "/anaconda3/envs/tf/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 92, in _prewrite_check
        compat.as_bytes(self.__name), compat.as_bytes(self.__mode))
    tensorflow.python.framework.errors_impl.FailedPreconditionError: samples_10_no_text_attn.pkl; Is a directory
    
    opened by BrambleXu 0
  • Interpretation of figures

    Interpretation of figures

    Can you please clarify what does each number (8,10) stand for in "Head 8-10" in Figure 5 of your article, i.e. layer or head number? Thank you very much for your clarification.

    opened by JiahuiSophieHU 0
  • Fixed bugs and sync script params with README.md text

    Fixed bugs and sync script params with README.md text

    1. Using unique _ separator in preprocess_unlabeled.py (sync with README.md);
    2. Providing --num_docs arg at README.md;
    3. Fixed bug with json writing when path contains only filename.
    opened by nicolay-r 0
  • Typo in instructions

    Typo in instructions

    The instructions say:

    We include two pre-processing scripts for going from a raw data file to JSON that can be supplied to attention_extractor.py

    I think what is meant is the extract_attention.py script in the repo?

    opened by bustrofedico 0
  • Bug in adding dummy word_repr for root

    Bug in adding dummy word_repr for root

    It should have been inserted at 0th index in -1 dimension, currently its added at last index. Since attention approximated for ROOT by adding start/end tokens would be at 0th index, it would expect word rep for root should also be at 0.

    By fixing this bug I got around 3% higher UAS.

    word_reprs = tf.concat([word_reprs, tf.zeros((n_words, 1, 200))], 1) # dummy for ROOT

    Should be replaced with,

    word_reprs = tf.concat([tf.zeros((n_words, 1, 200)), word_reprs], 1) # dummy for ROOT

    opened by keyurfaldu 0
  • Coreference Resolution

    Coreference Resolution

    Any plans of releasing the code for coreference analysis in the paper?

    Alternatively, it is possible to explain the methodology? Mainly around how the "head" word is chosen, and what is the exact set of antecedents used.

    opened by ameet-1997 0
Owner
Kevin Clark
Kevin Clark
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
C.J. Hutto 3.8k Dec 30, 2022
C.J. Hutto 2.8k Feb 18, 2021
Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

Balaji R 1 Jan 1, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 2, 2023
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 8, 2023
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2.3k Dec 29, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 220 Dec 11, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 4.2k Feb 18, 2021
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 903 Feb 17, 2021
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.5k Feb 18, 2021
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2k Feb 9, 2021