When BERT Plays the Lottery, All Tickets Are Winning

Sai

Last update: Nov 10, 2022

Related tags

Deep Learning bert-experiments

Overview

When BERT Plays the Lottery, All Tickets Are Winning

Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.

Environment

Install the requirements in your python 3.7.7 virtual environment.

pip install -r requirements.txt

These experiments were done on multi-gpu environment, were some experiments, benchmarks were run parallel. So some changes to the bash scripts to make it work for your environment.

Dataset

Download the GLUE dataset using data/download_glue.py and data/download_mnli_data.py. Follow the instructions in data/download_glue.py docstring for MRPC.
All data for the tasks should be organized in data/glue/task_name/ structure.
Extract the attention pattern classification labelled data.
```
cd data
tar -xvf head_classification_data.tar.gz
```

Training, Masking, and Evaluation

Switch cwd to src (cd src) as many paths are relative from that directory.

Fine-tune the BERT on GLUE tasks

./train.sh

Obtain the masks

./find_masks.sh

Train models with the masks applied in good, random and bad settings.

./train_with_masks.sh

Evaluate the trained models

./evaluate.sh

Note: These experiments were run through course of time and now stiched together into single scripts. So it might be better to run the training and evaluation commands in them one by one.

Train the CNN classifier on attention patterns normed and raw.

python classify_attention_patterns.py
python classify_normed_patterns.py

These only train the classifier.

Evaluation Analysis and Final Results

These are primarily done in jupyter notebooks in experiment_analysis directory. There are many experimental notebooks there. Here are the important ones used to generate results included in the paper.

Importance pruning Heatmaps. Ignore the final "train_subset" and "hans" settings.
Magnitude pruning Heatmap
Overlap of surviving components
Generate the random baseline
Attention Classification Patterns
Evaluation Result Comparisons and table
Statistics on mask correlation across seeds

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

22 Dec 8, 2022

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

An Effective Loss Function for Generating 3D Models from Single 2D Image without Rendering Papers with code | Paper Nikola Zubić Pietro Lio University

213 Dec 27, 2022

When BERT Plays the Lottery, All Tickets Are Winning

Related tags

Overview

When BERT Plays the Lottery, All Tickets Are Winning

Environment

Dataset

Training, Masking, and Evaluation

Evaluation Analysis and Final Results

You might also like...

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation

Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT

Code for pre-training CharacterBERT models (as well as BERT models).

Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

Owner

Sai

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

Efficient Lottery Ticket Finding: Less Data is More

PyTorch implementation of the paper The Lottery Ticket Hypothesis for Object Recognition

Winning solution of the Indoor Location & Navigation Kaggle competition

This is the winning solution of the Endocv-2021 grand challange.

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

The sixth place winning solution (6/220) in 2021 Gaofen Challenge.

I-BERT: Integer-only BERT Quantization

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)