Code for the paper "Are Sixteen Heads Really Better than One?"

Overview

Are Sixteen Heads Really Better than One?

This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than One?.

Prerequisite

First, you will need python >=3.6 with pytorch>=1.0. Then, clone our forks of fairseq (for MT experiments) and pytorch-pretrained-BERT (for BERT):

# Fairseq
git clone https://github.com/pmichel31415/fairseq
# Pytorch pretrained BERT
git clone https://github.com/pmichel31415/pytorch-pretrained-BERT
cd pytorch-pretrained-BERT
git checkout paul
cd ..

If you are running into issues with pytorch-pretrained-BERT (because you have another version installed globally for instance), check out this work around (thanks @insop).

You will also need sacrebleu to evaluate BLEU score (pip install sacrebleu).

Ablation experiments

BERT

Running

bash experiments/BERT/heads_ablation.sh MNLI

Will fine-tune a pretrained BERT on MNLI (stored in ./models/MNLI) and perform the individual head ablation experiment from Section 3.1 in the paper alternatively you can run the experiment with CoLA, MRCP or SST-2 as a task in place of MNLI.

MT

You can obtain the pretrained WMT model from this link from the fairseq repo now this link. Use the Moses tokenizer and subword-nmt in conjunction to the BPE codes provided with the pretrained model to prepair any input file you want. Then run:

bash experiments/MT/wmt_ablation.sh $BPE_SEGMENTED_SRC_FILE $DETOKENIZED_REF_FILE

Systematic Pruning Experiments

BERT

To iteratively prune 10% heads in order of increasing importance run

bash experiments/BERT/heads_pruning.sh MNLI --normalize_pruning_by_layer

This will reuse the BERT model fine-tuned if you have run the ablation experiment before (otherwise it'll just fine-tune it for you). The output of this is very verbose, but you can get the gist of the result by calling grep "strategy\|results" -A1 on the output.

WMT

Similarly, just run:

bash experiments/MT/prune_wmt.sh $BPE_SEGMENTED_SRC_FILE $DETOKENIZED_REF_FILE

You might want to change the paths in the experiment files to point to the binarized fairseq dataset on whic you want to estimate importance scores.

Comments
  • Not able to prune the BERT model

    Not able to prune the BERT model

    Hi, I am running the command

    bash experiments/BERT/heads_ablation.sh MNLI
    

    I am getting the following error

    Traceback (most recent call last):
      File "pytorch-pretrained-BERT/examples/run_classifier.py", line 578, in <module>
        main()
      File "pytorch-pretrained-BERT/examples/run_classifier.py", line 275, in main
        model.bert.mask_heads(to_prune)
      File "/home/pdguest/ishita/py-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 535, in __getattr__
        type(self).__name__, name))
    AttributeError: 'BertModel' object has no attribute 'mask_heads'
    
    need more info 
    opened by ishita1995 7
  • Is BERT finetuned after pruning?

    Is BERT finetuned after pruning?

    Hi, I'm currently working on attention head pruning on models. I think in your reported experiments, you fine-tuned bert when training downstream MNLI task, right? But does it also work to fix the bert representation after pruning and train downstream MNLI task? I appreciate your answer

    opened by Huan80805 2
  • Not able to obtain pretrained WMT model

    Not able to obtain pretrained WMT model

    Hello,

    I am trying to run the MT ablation experiments. When I ran the command

    wget https://s3.amazonaws.com/fairseq-py/models/wmt14.en-fr.joined-dict.transformer.tar.bz2

    I get the following error

    --2020-03-31 17:58:31-- https://s3.amazonaws.com/fairseq-py/models/wmt14.en-fr.joined-dict.transformer.tar.bz2 Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.207.205 Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.207.205|:443... connected. HTTP request sent, awaiting response... 404 Not Found 2020-03-31 17:58:31 ERROR 404: Not Found.

    opened by marwash25 2
  • Why do we need different normalization for all the layers compared to the last layer in BERT during importance score calculation?

    Why do we need different normalization for all the layers compared to the last layer in BERT during importance score calculation?

    Hi,

    I am trying to understand why did we need different normalization factors for the last layer of the BERT compared to all other layers?

    https://github.com/pmichel31415/pytorch-pretrained-BERT/blob/18a86a7035cf8a48d16c101a66e439bf6ab342f1/examples/classifier_eval.py#L246 vs https://github.com/pmichel31415/pytorch-pretrained-BERT/blob/18a86a7035cf8a48d16c101a66e439bf6ab342f1/examples/classifier_eval.py#L247

    opened by Hritikbansal 1
  • a question about run_classifier.py

    a question about run_classifier.py

    1、 (1)I do this and get a pruned model: model.bert.prune_heads(to_prune) (2) I set n_retrain_steps_after_pruning a value greater than 0 next: aaa then:

    bbb

      to retrain my pruned model, that is ok?
    

    2、I don't understand the difference between above method and retrain_pruned_heads(the following method)

    cccc

    THANK YOU !

    opened by Ixuanzhang 1
  • BERT actually_prune option not working

    BERT actually_prune option not working

    Hi,

    thanks for your code! The pruning works great using masking. However, when I tried to actually prune the model to see, if there's a speedup, it fails.

    bash experiments/BERT/heads_pruning.sh SST-2 --actually_prune
    
    1. The new_layer in prune_linear_layer has to be moved to the correct device.
    new_layer.to(layer.weight.device)
    
    1. The forward function fails, because the input and output shape of the previous layer do not seem to match:
    13:09:27-INFO: Evaluating following pruning strategy
    13:09:27-INFO: 9:3 10:10 11:3,7,8,9,10
    13:09:27-INFO: ***** Running evaluation *****
    13:09:27-INFO:   Num examples = 872
    13:09:27-INFO:   Batch size = 32
    Evaluating:   0%|                                                                | 0/28 [00:00<?, ?it/s]Traceback (most recent call last):
      File "pytorch-pretrained-BERT/examples/run_classifier.py", line 578, in <module>
        main()
      File "pytorch-pretrained-BERT/examples/run_classifier.py", line 514, in main
        scorer=processor.scorer,
      File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/examples/classifier_eval.py", line 78, in evaluate
        input_ids, segment_ids, input_mask, label_ids)
      File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 1072, in forward
        output_all_encoded_layers=False, return_att=return_att)
      File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 769, in forward
        output_all_encoded_layers=output_all_encoded_layers)
      File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 458, in forward
        hidden_states, attn = layer_module(hidden_states, attention_mask)
      File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 441, in forward
        attention_output, attn = self.attention(hidden_states, attention_mask)
      File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 335, in forward
        self_output, attn = self.self(input_tensor, attention_mask)
      File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 274, in forward
        mixed_query_layer = self.query(hidden_states)
      File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 92, in forward
        return F.linear(input, self.weight, self.bias)
      File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/functional.py", line 1408, in linear
        output = input.matmul(weight.t())
    RuntimeError: size mismatch, m1: [4096 x 768], m2: [576 x 768] at /opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/generic/THCTensorMathBlas.cu:268
    
    opened by pglock 1
  • No code on master?

    No code on master?

    In the README, it asks us to checkout a branch not put on master. I don't think that branch is uploaded to master, as the github has no actual code in it.

    opened by aninrusimha 1
  • Is the code still able to run?

    Is the code still able to run?

    Hi,

    I am trying to reproduce your result of BERT. I followed the Prerequisite:

    # Pytorch pretrained BERT
    git clone https://github.com/pmichel31415/pytorch-pretrained-BERT
    cd pytorch-pretrained-BERT
    git checkout paul
    cd ..
    
    # Install the pytorch-pretrained_BERT:
    cd pytorch-pretrained-BERT
    pip install .
    cd ..
    
    # Run the code:
    bash experiments/BERT/heads_ablation.sh MNLI
    

    But got this error:

    02:06:57-INFO: Weights of BertForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
    02:06:57-INFO: Weights from pretrained model not used in BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
    Traceback (most recent call last):
      File "pytorch-pretrained-BERT/examples/run_classifier.py", line 582, in <module>
        main()
      File "pytorch-pretrained-BERT/examples/run_classifier.py", line 275, in main
        model.bert.mask_heads(to_prune)
      File "/home/guest/anaconda3/envs/huggingface_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
        type(self).__name__, name))
    AttributeError: 'DataParallel' object has no attribute 'bert'
    
    
    1(standard_in) 2: syntax error
            0.00000(standard_in) 2: syntax error
            0.00000(standard_in) 2: syntax error
            0.00000(standard_in) 2: syntax error
            0.00000(standard_in) 2: syntax error
            0.00000(standard_in) 2: syntax error
    

    Any idea or suggestion?

    opened by bing0037 2
  • about the params: --raw-text and --transformer-mask-heads

    about the params: --raw-text and --transformer-mask-heads

    Hi ! @pmichel31415 1.In are-16-heads-really-better-than-1/experiments/MT/prune_wmt.sh you have the --raw-text $EXTRA_OPTIONS, and I don't know the meaning. Can you tell me its explanation and how to use it? It is the origin ref text or something? 2. I don't know how to use the --transformer-mask-heads . Can you show me an example?

    opened by LiangQiqi677 0
  • RuntimeError: can't retain_grad on Tensor that has requires_grad=False

    RuntimeError: can't retain_grad on Tensor that has requires_grad=False

    Sorry to bother you. I met a bug druing runing the "heads_pruning.sh", and the error is:

    12:21:27-INFO: ***** Running evaluation ***** 12:21:27-INFO: Num examples = 9815 12:21:27-INFO: Batch size = 32 Evaluating: 0% 0/307 [00:00<?, ?it/s]Traceback (most recent call last): File "pytorch-pretrained-BERT/examples/run_classifier.py", line 585, in main() File "pytorch-pretrained-BERT/examples/run_classifier.py", line 521, in main scorer=processor.scorer, File "/content/drive/My Drive/XAI in NLP/pytorch-pretrained-BERT/examples/classifier_eval.py", line 78, in evaluate input_ids, segment_ids, input_mask, label_ids) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/content/drive/My Drive/XAI in NLP/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 1072, in forward output_all_encoded_layers=False, return_att=return_att) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/content/drive/My Drive/XAI in NLP/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 769, in forward output_all_encoded_layers=output_all_encoded_layers) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/content/drive/My Drive/XAI in NLP/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 458, in forward hidden_states, attn = layer_module(hidden_states, attention_mask) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/content/drive/My Drive/XAI in NLP/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 441, in forward attention_output, attn = self.attention(hidden_states, attention_mask) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/content/drive/My Drive/XAI in NLP/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 335, in forward self_output, attn = self.self(input_tensor, attention_mask) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/content/drive/My Drive/XAI in NLP/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 307, in forward self.context_layer_val.retain_grad() File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 326, in retain_grad raise RuntimeError("can't retain_grad on Tensor that has requires_grad=False") RuntimeError: can't retain_grad on Tensor that has requires_grad=False Evaluating: 0% 0/307 [00:00<?, ?it/s]

    I don't know how to fix it. Hope you can help me!

    opened by YJiangcm 1
Owner
Paul Michel
Laplace Postdoctoral Chair in Data Science at École Normale Supérieure, Paris
Paul Michel
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 1, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 3.2k Feb 17, 2021
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Amazon 69 Jan 3, 2023
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

null 1.1k Dec 27, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Francis R. Willett 305 Dec 22, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

null 49 Dec 17, 2022
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 44 Jan 6, 2023
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
null 189 Jan 2, 2023
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Lars Mescheder 884 Nov 11, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 9, 2022
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

null 79 Dec 27, 2022