When BERT Plays the Lottery, All Tickets Are Winning
Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.
Environment
Install the requirements in your python 3.7.7 virtual environment.
pip install -r requirements.txt
These experiments were done on multi-gpu environment, were some experiments, benchmarks were run parallel. So some changes to the bash scripts to make it work for your environment.
Dataset
- Download the GLUE dataset using
data/download_glue.py
anddata/download_mnli_data.py
. Follow the instructions indata/download_glue.py
docstring for MRPC. - All data for the tasks should be organized in
data/glue/task_name/
structure. - Extract the attention pattern classification labelled data.
cd data tar -xvf head_classification_data.tar.gz
Training, Masking, and Evaluation
Switch cwd to src (cd src
) as many paths are relative from that directory.
- Fine-tune the BERT on GLUE tasks
./train.sh
- Obtain the masks
./find_masks.sh
- Train models with the masks applied in good, random and bad settings.
./train_with_masks.sh
- Evaluate the trained models
./evaluate.sh
Note: These experiments were run through course of time and now stiched together into single scripts. So it might be better to run the training and evaluation commands in them one by one.
- Train the CNN classifier on attention patterns normed and raw.
python classify_attention_patterns.py
python classify_normed_patterns.py
These only train the classifier.
Evaluation Analysis and Final Results
These are primarily done in jupyter notebooks in experiment_analysis
directory. There are many experimental notebooks there. Here are the important ones used to generate results included in the paper.
- Importance pruning Heatmaps. Ignore the final "train_subset" and "hans" settings.
- Magnitude pruning Heatmap
- Overlap of surviving components
- Generate the random baseline
- Attention Classification Patterns
- Evaluation Result Comparisons and table
- Statistics on mask correlation across seeds