BERT Attention Analysis

Kevin Clark

Last update: Dec 11, 2022

Related tags

Text Data & NLP attention-analysis

Overview

BERT Attention Analysis

This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attention maps from BERT and writing them to disk, analyzing BERT's attention in general (sections 3 and 6 of the paper), and comparing its attention to dependency syntax (sections 4.2 and 5). We will add the code for the coreference resolution analysis (section 4.3 of the paper) soon!

Requirements

For extracting attention maps from text:

Additional requirements for the attention analysis:

Attention Analysis

Syntax_Analysis.ipynb and General_Analysis.ipynb contain code for analyzing BERT's attention, including reproducing the figures and tables in the paper.

You can download the data needed to run the notebooks (including BERT attention maps on Wikipedia and the Penn Treebank) from here. However, note that the Penn Treebank annotations are not freely available, so the Penn Treebank data only includes dummy labels. If you want to run the analysis on your own data, you can use the scripts described below to extract BERT attention maps.

Extracting BERT Attention Maps

We provide a script for running BERT over text and writing the resulting attention maps to disk. The input data should be a JSON file containing a list of dicts, each one corresponding to a single example to be passed in to BERT. Each dict must contain exactly one of the following fields:

"text": A string.
"words": A list of strings. Needed if you want word-level rather than token-level attention.
"tokens": A list of strings corresponding to BERT wordpiece tokenization.

If the present field is "tokens," the script expects [CLS]/[SEP] tokens to be already added; otherwise it adds these tokens to the beginning/end of the text automatically. Note that if an example is longer than max_sequence_length tokens after BERT wordpiece tokenization, attention maps will not be extracted for it. Attention extraction adds two additional fields to each dict:

"attns": A numpy array of size [num_layers, heads_per_layer, sequence_length, sequence_length] containing attention weights.
"tokens": If "tokens" was not already provided for the example, the BERT-wordpiece-tokenized text (list of strings).

Other fields already in the feature dicts will be preserved. For example if each dict has a tags key containing POS tags, they will stay in the data after attention extraction so they can be used when analyzing the data.

Attention extraction is run with

python extract_attention.py --preprocessed_data_file 
   
     --bert_dir

The following optional arguments can also be added:

--max_sequence_length: Maximum input sequence length after tokenization (default is 128).
--batch_size: Batch size when running BERT over examples (default is 16).
--debug: Use a tiny BERT model for fast debugging.
--cased: Do not lowercase the input text.
--word_level: Compute word-level instead of token-level attention (see Section 4.1 of the paper).

The feature dicts with added attention maps (numpy arrays with shape [n_layers, n_heads_per_layer, n_tokens, n_tokens]) are written to _attn.pkl

Pre-processing Scripts

We include two pre-processing scripts for going from a raw data file to JSON that can be supplied to attention_extractor.py.

preprocess_unlabeled.py does BERT-pre-training-style preprocessing for unlabeled text (i.e, taking two consecutive text spans, truncating them so they are at most max_sequence_length tokens, and adding [CLS]/[SEP] tokens). Each line of the input data file should be one sentence. Documents should be separated by empty lines. Example usage:

python preprocess_unlabeled.py --data-file $ATTN_DATA_DIR/unlabeled.txt --bert-dir $ATTN_DATA_DIR/uncased_L-12_H-768_A-12

will create the file $ATTN_DATA_DIR/unlabeled.json containing pre-processed data. After pre-processing, you can run extract_attention.py to get attention maps, e.g.,

python extract_attention.py --preprocessed-data-file $ATTN_DATA_DIR/unlabeled.json --bert-dir $ATTN_DATA_DIR/uncased_L-12_H-768_A-12

preprocess_depparse.py pre-processes dependency parsing data. Dependency parsing data should consist of two files train.txt and dev.txt under a common directory. Each line in the files should contain a word followed by a space followed by - (e.g., 0-root). Examples should be separated by empty lines. Example usage:

python preprocess_depparse.py --data-dir $ATTN_DATA_DIR/depparse

After pre-processing, you can run extract_attention.py to get attention maps, e.g.,

python extract_attention.py --preprocessed-data-file $ATTN_DATA_DIR/depparse/dev.json --bert-dir $ATTN_DATA_DIR/uncased_L-12_H-768_A-12 --word_level

Computing Distances Between Attention Heads

head_distances.py computes the average Jenson-Shannon divergence between the attention weights of all pairs of attention heads and writes the results to disk as a numpy array of shape [n_heads, n_heads]. These distances can be used to cluster BERT's attention heads (see Section 6 and Figure 6 of the paper; code for doing this clustering is in General_Analysis.ipynb). Example usage (requires that attention maps have already been extracted):

python head_distances.py --attn-data-file $ATTN_DATA_DIR/unlabeled_attn.pkl --outfile $ATTN_DATA_DIR/head_distances.pkl

Citation

If you find the code or data helpful, please cite the original paper:

@inproceedings{clark2019what,
  title = {What Does BERT Look At? An Analysis of BERT's Attention},
  author = {Kevin Clark and Urvashi Khandelwal and Omer Levy and Christopher D. Manning},
  booktitle = {BlackBoxNLP@ACL},
  year = {2019}
}

Contact

Kevin Clark (@clarkkev).

Comments

How to deal with the OOV words

My sample contains the word 'Silsby', but it does not exist in the vocab. How to deal with the OOV situation?

python extract_attention.py --preprocessed-data-file samples_10.json --bert-dir data/cased_L-12_H-768_A-12  --max_sequence_length 256 --word_level --cased

Creating examples...
Traceback (most recent call last):
  File "extract_attention.py", line 144, in <module>
    main()
  File "extract_attention.py", line 108, in main
    example = Example(features, tokenizer, args.max_sequence_length)
  File "extract_attention.py", line 29, in __init__
    self.input_ids = tokenizer.convert_tokens_to_ids(self.tokens)
  File "/Users/smap10/Project/attention-analysis-master/bert/tokenization.py", line 182, in convert_tokens_to_ids
    return convert_by_vocab(self.vocab, tokens)
  File "/Users/smap10/Project/attention-analysis-master/bert/tokenization.py", line 143, in convert_by_vocab
    output.append(vocab[item])
KeyError: 'Silsby'

opened by BrambleXu 6

Fix pickle unicode error
error message: 'ascii' codec can't decode byte 0x84 in position 0: ordinal not in range(128)

ref: https://stackoverflow.com/questions/11305790/pickle-incompatibility-of-numpy-arrays-between-python-2-and-3
opened by insop 2
for extract attention --word_level attention doesnt work

I have used preprocess_unlabelled to create my json file. Can you provide a preprocess file which would help word level attention.

When I use this file and when I pass the argument --word_level to extract_attention I get the following error: Converting to word-level attention... Traceback (most recent call last): File "extract_attention.py", line 144, in main() File "extract_attention.py", line 134, in main feature_dicts_with_attn, tokenizer, args.cased) File "/home/manasi/Documents/BERT/attention-analysis-master/bpe_utils.py", line 74, in make_attn_word_level words_to_tokens = tokenize_and_align(tokenizer, features["words"], cased) KeyError: 'words'

Please help me fix it.

opened by ManasiPat 2
File missing

First of I want to say this is a great insight into the BERT model. But there seems to be a file missing from General Analysis. The file as said can be used to generate one's own wiki data. (create_wiki_data.py). It would be quite helpful if you provide it .

opened by imr555 2
An explanation for head_distances.py

Thank you for releasing the codes.

An explanation for head_distances.py is missing in README. Could you add the explanation?

A small question: in head_distances.py, the line utils.write_pickle(js_distances, args.outfile) should be outside the "for loop" for i, doc in enumerate(data):?

opened by tomohideshibata 2
While extracting attention weights, set segment_ids all zeros ?

The code in this line means set segment ids to zeros for both segments, I don't know whether this is a bug?

https://github.com/clarkkev/attention-analysis/blob/7b4ed20b2c58a211970ffc19c2d957b2c35ea0ea/extract_attention.py#L30

opened by HuXiangkun 1
a question about preprocess_depparse.py

In readme ,I can see ''Each line in the files should contain a word followed by a space followed by <index_of_head>-<dependency_label> (e.g., 0-root).''

So how can I get the index_of_head, and what's the meaning of it?

Should I know which heads(12*12) in bert is for which syntactic function firstly ,and then judge it? Sorry for my pool English.

opened by fudanchenjiahao 1

solve error if path not contain directory

Solve error if path does not contain a directory

Traceback (most recent call last):
  File "extract_attention.py", line 144, in <module>
    main()
  File "extract_attention.py", line 139, in main
    utils.write_pickle(feature_dicts_with_attn, outpath)
  File "/Users/smap10/Project/attention-analysis-master/utils.py", line 31, in write_pickle
    pickle.dump(o, f, -1)
  File "/anaconda3/envs/tf/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 106, in write
    self._prewrite_check()
  File "/anaconda3/envs/tf/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 92, in _prewrite_check
    compat.as_bytes(self.__name), compat.as_bytes(self.__mode))
tensorflow.python.framework.errors_impl.FailedPreconditionError: samples_10_no_text_attn.pkl; Is a directory

opened by BrambleXu 0

Interpretation of figures

Can you please clarify what does each number (8,10) stand for in "Head 8-10" in Figure 5 of your article, i.e. layer or head number? Thank you very much for your clarification.

opened by JiahuiSophieHU 0
Fixed bugs and sync script params with README.md text
Using unique _ separator in preprocess_unlabeled.py (sync with README.md);

Providing --num_docs arg at README.md;

Fixed bug with json writing when path contains only filename.
opened by nicolay-r 0
Typo in instructions

The instructions say:

We include two pre-processing scripts for going from a raw data file to JSON that can be supplied to attention_extractor.py

I think what is meant is the extract_attention.py script in the repo?

opened by bustrofedico 0
Bug in adding dummy word_repr for root

It should have been inserted at 0th index in -1 dimension, currently its added at last index. Since attention approximated for ROOT by adding start/end tokens would be at 0th index, it would expect word rep for root should also be at 0.

By fixing this bug I got around 3% higher UAS.

word_reprs = tf.concat([word_reprs, tf.zeros((n_words, 1, 200))], 1) # dummy for ROOT

Should be replaced with,

word_reprs = tf.concat([tf.zeros((n_words, 1, 200)), word_reprs], 1) # dummy for ROOT

opened by keyurfaldu 0
Coreference Resolution

Any plans of releasing the code for coreference analysis in the paper?

Alternatively, it is possible to explain the methodology? Mainly around how the "head" word is chosen, and what is the exact set of antecedents used.

opened by ameet-1997 0

Owner

Kevin Clark

GitHub

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

59 Dec 1, 2022

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

30 Dec 12, 2022

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER-Sentiment-Analysis VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifica

3.8k Dec 30, 2022

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER-Sentiment-Analysis VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifica

2.8k Feb 18, 2021

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

1 Jan 1, 2022

Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

7 Mar 12, 2022

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

109 Dec 21, 2022

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

9.1k Jan 2, 2023

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

1.2k Jan 8, 2023

Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

1.8k Dec 27, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

2.3k Dec 29, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

2k Feb 9, 2021

BERT Attention Analysis

Related tags

Overview

BERT Attention Analysis

Requirements

Attention Analysis

Extracting BERT Attention Maps

Pre-processing Scripts

Computing Distances Between Attention Heads

Citation

Contact

Comments

Owner

Kevin Clark

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Sentence Embeddings with BERT & XLNet

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Super easy library for BERT based NLP models

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

Sentence Embeddings with BERT & XLNet

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Super easy library for BERT based NLP models

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.