Block Sparse movement pruning

Overview

Movement Pruning: Adaptive Sparsity by Fine-Tuning

Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters:

Fine-pruning+Distillation
(Teacher=BERT-base fine-tuned)
BERT base
fine-tuned
Remaining
Weights (%)
Magnitude Pruning L0 Regularization Movement Pruning Soft Movement Pruning
SQuAD - Dev
EM/F1
80.4/88.1 10%
3%
70.2/80.1
45.5/59.6
72.4/81.9
64.3/75.8
75.6/84.3
67.5/78.0
76.6/84.9
72.7/82.3
MNLI - Dev
acc/MM acc
84.5/84.9 10%
3%
78.3/79.3
69.4/70.6
78.7/79.7
76.0/76.2
80.1/80.4
76.5/77.4
81.2/81.8
79.5/80.1
QQP - Dev
acc/F1
91.4/88.4 10%
3%
79.8/65.0
72.4/57.8
88.1/82.8
87.0/81.9
89.7/86.2
86.1/81.5
90.2/86.8
89.1/85.5

This page contains information on how to fine-prune pre-trained models such as BERT to obtain extremely sparse models with movement pruning. In contrast to magnitude pruning which selects weights that are far from 0, movement pruning retains weights that are moving away from 0.

For more information, we invite you to check out our paper. You can also have a look at this fun Explain Like I'm Five introductory slide deck.

Extreme sparsity and efficient storage

One promise of extreme pruning is to obtain extremely small models that can be easily sent (and stored) on edge devices. By setting weights to 0., we reduce the amount of information we need to store, and thus decreasing the memory size. We are able to obtain extremely sparse fine-pruned models with movement pruning: ~95% of the dense performance with ~5% of total remaining weights in the BERT encoder.

In this notebook, we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder from the 340MB (the orignal dense BERT) to 11MB, without any additional training of the model (every operation is performed post fine-pruning). It is sufficiently small to store it on a 91' floppy disk 📎 !

While movement pruning does not directly optimize for memory footprint (but rather the number of non-null weights), we hypothetize that further memory compression ratios can be achieved with specific quantization aware trainings (see for instance Q8BERT, And the Bit Goes Down or Quant-Noise).

Fine-pruned models

As examples, we release two English PruneBERT checkpoints (models fine-pruned from a pre-trained BERT checkpoint), one on SQuAD and the other on MNLI.

  • prunebert-base-uncased-6-finepruned-w-distil-squad
    Pre-trained BERT-base-uncased fine-pruned with soft movement pruning on SQuAD v1.1. We use an additional distillation signal from BERT-base-uncased finetuned on SQuAD. The encoder counts 6% of total non-null weights and reaches 83.8 F1 score. The model can be accessed with: pruned_bert = BertForQuestionAnswering.from_pretrained("huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad")
  • prunebert-base-uncased-6-finepruned-w-distil-mnli
    Pre-trained BERT-base-uncased fine-pruned with soft movement pruning on MNLI. We use an additional distillation signal from BERT-base-uncased finetuned on MNLI. The encoder counts 6% of total non-null weights and reaches 80.7 (matched) accuracy. The model can be accessed with: pruned_bert = BertForSequenceClassification.from_pretrained("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")

How to fine-prune?

Setup

The code relies on the 🤗 Transformers library. In addition to the dependencies listed in the examples folder, you should install a few additional dependencies listed in the requirements.txt file: pip install -r requirements.txt.

Note that we built our experiments on top of a stabilized version of the library (commit https://github.com/huggingface/transformers/commit/352d5472b0c1dec0f420d606d16747d851b4bda8): we do not guarantee that everything is still compatible with the latest version of the master branch.

Fine-pruning with movement pruning

Below, we detail how to reproduce the results reported in the paper. We use SQuAD as a running example. Commands (and scripts) can be easily adapted for other tasks.

The following command fine-prunes a pre-trained BERT-base on SQuAD using movement pruning towards 15% of remaining weights (85% sparsity). Note that we freeze all the embeddings modules (from their pre-trained value) and only prune the Fully Connected layers in the encoder (12 layers of Transformer Block).

SERIALIZATION_DIR=<OUTPUT_DIR>
SQUAD_DATA=squad_data

mkdir $SQUAD_DATA
cd $SQUAD_DATA
wget -q https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
wget -q https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
cd ..


python examples/movement-pruning/masked_run_squad.py \
    --output_dir $SERIALIZATION_DIR \
    --data_dir $SQUAD_DATA \
    --train_file train-v1.1.json \
    --predict_file dev-v1.1.json \
    --do_train --do_eval --do_lower_case \
    --model_type masked_bert \
    --model_name_or_path bert-base-uncased \
    --per_gpu_train_batch_size 16 \
    --warmup_steps 5400 \
    --num_train_epochs 10 \
    --learning_rate 3e-5 --mask_scores_learning_rate 1e-2 \
    --initial_threshold 1 --final_threshold 0.15 \
    --initial_warmup 1 --final_warmup 2 \
    --pruning_method topK --mask_init constant --mask_scale 0.

Fine-pruning with other methods

We can also explore other fine-pruning methods by changing the pruning_method parameter:

Soft movement pruning

python examples/movement-pruning/masked_run_squad.py \
    --output_dir $SERIALIZATION_DIR \
    --data_dir $SQUAD_DATA \
    --train_file train-v1.1.json \
    --predict_file dev-v1.1.json \
    --do_train --do_eval --do_lower_case \
    --model_type masked_bert \
    --model_name_or_path bert-base-uncased \
    --per_gpu_train_batch_size 16 \
    --warmup_steps 5400 \
    --num_train_epochs 10 \
    --learning_rate 3e-5 --mask_scores_learning_rate 1e-2 \
    --initial_threshold 0 --final_threshold 0.1 \
    --initial_warmup 1 --final_warmup 2 \
    --pruning_method sigmoied_threshold --mask_init constant --mask_scale 0. \
    --regularization l1 --final_lambda 400.

L0 regularization

python examples/movement-pruning/masked_run_squad.py \
    --output_dir $SERIALIZATION_DIR \
    --data_dir $SQUAD_DATA \
    --train_file train-v1.1.json \
    --predict_file dev-v1.1.json \
    --do_train --do_eval --do_lower_case \
    --model_type masked_bert \
    --model_name_or_path bert-base-uncased \
    --per_gpu_train_batch_size 16 \
    --warmup_steps 5400 \
    --num_train_epochs 10 \
    --learning_rate 3e-5 --mask_scores_learning_rate 1e-1 \
    --initial_threshold 1. --final_threshold 1. \
    --initial_warmup 1 --final_warmup 1 \
    --pruning_method l0 --mask_init constant --mask_scale 2.197 \
    --regularization l0 --final_lambda 125.

Iterative Magnitude Pruning

python examples/movement-pruning/masked_run_squad.py \
    --output_dir ./dbg \
    --data_dir examples/distillation/data/squad_data \
    --train_file train-v1.1.json \
    --predict_file dev-v1.1.json \
    --do_train --do_eval --do_lower_case \
    --model_type masked_bert \
    --model_name_or_path bert-base-uncased \
    --per_gpu_train_batch_size 16 \
    --warmup_steps 5400 \
    --num_train_epochs 10 \
    --learning_rate 3e-5 \
    --initial_threshold 1 --final_threshold 0.15 \
    --initial_warmup 1 --final_warmup 2 \
    --pruning_method magnitude

After fine-pruning

Counting parameters

Regularization based pruning methods (soft movement pruning and L0 regularization) rely on the penalty to induce sparsity. The multiplicative coefficient controls the sparsity level. To obtain the effective sparsity level in the encoder, we simply count the number of activated (non-null) weights:

python examples/movement-pruning/counts_parameters.py \
    --pruning_method sigmoied_threshold \
    --threshold 0.1 \
    --serialization_dir $SERIALIZATION_DIR

Pruning once for all

Once the model has been fine-pruned, the pruned weights can be set to 0. once for all (reducing the amount of information to store). In our running experiments, we can convert a MaskedBertForQuestionAnswering (a BERT model augmented to enable on-the-fly pruning capabilities) to a standard BertForQuestionAnswering:

python examples/movement-pruning/bertarize.py \
    --pruning_method sigmoied_threshold \
    --threshold 0.1 \
    --model_name_or_path $SERIALIZATION_DIR

Hyper-parameters

For reproducibility purposes, we share the detailed results presented in the paper. These tables exhaustively describe the individual hyper-parameters used for each data point.

Inference speed

Early experiments show that even though models fine-pruned with (soft) movement pruning are extremely sparse, they do not benefit from significant improvement in terms of inference speed when using the standard PyTorch inference. We are currently benchmarking and exploring inference setups specifically for sparse architectures. In particular, hardware manufacturers are announcing devices that will speedup inference for sparse networks considerably.

Citation

If you find this resource useful, please consider citing the following paper:

@article{sanh2020movement,
    title={Movement Pruning: Adaptive Sparsity by Fine-Tuning},
    author={Victor Sanh and Thomas Wolf and Alexander M. Rush},
    year={2020},
    eprint={2005.07683},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
You might also like...
A Closer Look at Structured Pruning for Neural Network Compression

A Closer Look at Structured Pruning for Neural Network Compression Code used to reproduce experiments in https://arxiv.org/abs/1810.04622. To prune, w

Learned Token Pruning for Transformers
Learned Token Pruning for Transformers

LTP: Learned Token Pruning for Transformers Check our paper for more details. Installation We follow the same installation procedure as the original H

Code for PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning

PackNet: https://arxiv.org/abs/1711.05769 Pretrained models are available here: https://uofi.box.com/s/zap2p03tnst9dfisad4u0sfupc0y1fxt Datasets in Py

Learnable Motion Coherence for Correspondence Pruning

Learnable Motion Coherence for Correspondence Pruning Yuan Liu, Lingjie Liu, Cheng Lin, Zhen Dong, Wenping Wang Project Page Any questions or discussi

A Python implementation of the Locality Preserving Matching (LPM) method for pruning outliers in image matching.

LPM_Python A Python implementation of the Locality Preserving Matching (LPM) method for pruning outliers in image matching. The code is established ac

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

Channel Pruning for Accelerating Very Deep Neural Networks (ICCV'17)
Channel Pruning for Accelerating Very Deep Neural Networks (ICCV'17)

Channel Pruning for Accelerating Very Deep Neural Networks (ICCV'17)

Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)
Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

YOLOv5 Series Multi-backbone, Pruning and quantization Compression Tool Box.

YOLOv5-Compression Update News Requirements 环境安装 pip install -r requirements.txt Evaluation metric Visdrone Model mAP mAP@50 Parameters(M) GFLOPs FPS@

Comments
  • TypeError: forward() got an unexpected keyword argument 'threshold'

    TypeError: forward() got an unexpected keyword argument 'threshold'

    It does not recognise "threshold" when I try to assign "threshold" as input to the transformer. Below is the output I am getting.

    W0529 17:55:39.823194 140241775109952 masked_run_glue.py:838] Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
    /home/charles/anaconda3/envs/bertprune/lib/python3.6/site-packages/transformers/data/processors/glue.py:284: FutureWarning: This processor will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py
      warnings.warn(DEPRECATION_WARNING.format("processor"), FutureWarning)
    Some weights of the model checkpoint at bert-base-uncased were not used when initializing MaskedBertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
    - This IS expected if you are initializing MaskedBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
    - This IS NOT expected if you are initializing MaskedBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of MaskedBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.attention.self.query.mask_scores', 'bert.encoder.layer.0.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.0.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.0.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.0.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.0.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.0.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.0.attention.self.key.mask_scores', 'bert.encoder.layer.0.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.0.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.0.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.0.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.0.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.0.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.0.attention.self.value.mask_scores', 'bert.encoder.layer.0.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.0.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.0.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.0.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.0.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.0.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.0.attention.output.dense.mask_scores', 'bert.encoder.layer.0.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.0.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.0.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.0.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.0.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.0.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.0.intermediate.dense.mask_scores', 'bert.encoder.layer.0.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.0.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.0.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.0.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.0.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.0.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.0.output.dense.mask_scores', 'bert.encoder.layer.0.output.dense.ampere_permut_scores', 'bert.encoder.layer.0.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.0.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.0.output.dense.shuffler.in_mapping', 'bert.encoder.layer.0.output.dense.shuffler.out_mapping', 'bert.encoder.layer.0.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.attention.self.query.mask_scores', 'bert.encoder.layer.1.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.1.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.1.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.1.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.1.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.1.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.attention.self.key.mask_scores', 'bert.encoder.layer.1.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.1.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.1.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.1.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.1.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.1.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.attention.self.value.mask_scores', 'bert.encoder.layer.1.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.1.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.1.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.1.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.1.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.1.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.attention.output.dense.mask_scores', 'bert.encoder.layer.1.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.1.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.1.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.1.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.1.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.1.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.intermediate.dense.mask_scores', 'bert.encoder.layer.1.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.1.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.1.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.1.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.1.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.1.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.output.dense.mask_scores', 'bert.encoder.layer.1.output.dense.ampere_permut_scores', 'bert.encoder.layer.1.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.1.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.1.output.dense.shuffler.in_mapping', 'bert.encoder.layer.1.output.dense.shuffler.out_mapping', 'bert.encoder.layer.1.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.attention.self.query.mask_scores', 'bert.encoder.layer.2.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.2.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.2.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.2.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.2.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.2.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.attention.self.key.mask_scores', 'bert.encoder.layer.2.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.2.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.2.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.2.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.2.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.2.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.attention.self.value.mask_scores', 'bert.encoder.layer.2.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.2.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.2.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.2.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.2.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.2.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.attention.output.dense.mask_scores', 'bert.encoder.layer.2.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.2.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.2.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.2.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.2.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.2.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.intermediate.dense.mask_scores', 'bert.encoder.layer.2.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.2.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.2.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.2.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.2.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.2.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.output.dense.mask_scores', 'bert.encoder.layer.2.output.dense.ampere_permut_scores', 'bert.encoder.layer.2.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.2.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.2.output.dense.shuffler.in_mapping', 'bert.encoder.layer.2.output.dense.shuffler.out_mapping', 'bert.encoder.layer.2.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.attention.self.query.mask_scores', 'bert.encoder.layer.3.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.3.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.3.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.3.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.3.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.3.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.attention.self.key.mask_scores', 'bert.encoder.layer.3.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.3.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.3.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.3.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.3.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.3.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.attention.self.value.mask_scores', 'bert.encoder.layer.3.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.3.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.3.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.3.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.3.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.3.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.attention.output.dense.mask_scores', 'bert.encoder.layer.3.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.3.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.3.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.3.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.3.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.3.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.intermediate.dense.mask_scores', 'bert.encoder.layer.3.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.3.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.3.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.3.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.3.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.3.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.output.dense.mask_scores', 'bert.encoder.layer.3.output.dense.ampere_permut_scores', 'bert.encoder.layer.3.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.3.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.3.output.dense.shuffler.in_mapping', 'bert.encoder.layer.3.output.dense.shuffler.out_mapping', 'bert.encoder.layer.3.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.attention.self.query.mask_scores', 'bert.encoder.layer.4.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.4.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.4.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.4.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.4.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.4.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.attention.self.key.mask_scores', 'bert.encoder.layer.4.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.4.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.4.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.4.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.4.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.4.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.attention.self.value.mask_scores', 'bert.encoder.layer.4.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.4.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.4.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.4.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.4.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.4.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.attention.output.dense.mask_scores', 'bert.encoder.layer.4.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.4.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.4.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.4.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.4.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.4.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.intermediate.dense.mask_scores', 'bert.encoder.layer.4.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.4.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.4.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.4.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.4.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.4.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.output.dense.mask_scores', 'bert.encoder.layer.4.output.dense.ampere_permut_scores', 'bert.encoder.layer.4.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.4.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.4.output.dense.shuffler.in_mapping', 'bert.encoder.layer.4.output.dense.shuffler.out_mapping', 'bert.encoder.layer.4.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.attention.self.query.mask_scores', 'bert.encoder.layer.5.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.5.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.5.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.5.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.5.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.5.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.attention.self.key.mask_scores', 'bert.encoder.layer.5.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.5.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.5.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.5.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.5.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.5.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.attention.self.value.mask_scores', 'bert.encoder.layer.5.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.5.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.5.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.5.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.5.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.5.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.attention.output.dense.mask_scores', 'bert.encoder.layer.5.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.5.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.5.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.5.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.5.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.5.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.intermediate.dense.mask_scores', 'bert.encoder.layer.5.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.5.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.5.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.5.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.5.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.5.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.output.dense.mask_scores', 'bert.encoder.layer.5.output.dense.ampere_permut_scores', 'bert.encoder.layer.5.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.5.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.5.output.dense.shuffler.in_mapping', 'bert.encoder.layer.5.output.dense.shuffler.out_mapping', 'bert.encoder.layer.5.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.attention.self.query.mask_scores', 'bert.encoder.layer.6.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.6.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.6.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.6.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.6.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.6.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.attention.self.key.mask_scores', 'bert.encoder.layer.6.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.6.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.6.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.6.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.6.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.6.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.attention.self.value.mask_scores', 'bert.encoder.layer.6.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.6.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.6.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.6.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.6.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.6.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.attention.output.dense.mask_scores', 'bert.encoder.layer.6.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.6.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.6.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.6.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.6.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.6.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.intermediate.dense.mask_scores', 'bert.encoder.layer.6.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.6.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.6.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.6.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.6.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.6.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.output.dense.mask_scores', 'bert.encoder.layer.6.output.dense.ampere_permut_scores', 'bert.encoder.layer.6.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.6.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.6.output.dense.shuffler.in_mapping', 'bert.encoder.layer.6.output.dense.shuffler.out_mapping', 'bert.encoder.layer.6.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.attention.self.query.mask_scores', 'bert.encoder.layer.7.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.7.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.7.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.7.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.7.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.7.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.attention.self.key.mask_scores', 'bert.encoder.layer.7.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.7.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.7.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.7.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.7.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.7.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.attention.self.value.mask_scores', 'bert.encoder.layer.7.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.7.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.7.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.7.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.7.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.7.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.attention.output.dense.mask_scores', 'bert.encoder.layer.7.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.7.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.7.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.7.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.7.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.7.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.intermediate.dense.mask_scores', 'bert.encoder.layer.7.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.7.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.7.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.7.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.7.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.7.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.output.dense.mask_scores', 'bert.encoder.layer.7.output.dense.ampere_permut_scores', 'bert.encoder.layer.7.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.7.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.7.output.dense.shuffler.in_mapping', 'bert.encoder.layer.7.output.dense.shuffler.out_mapping', 'bert.encoder.layer.7.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.attention.self.query.mask_scores', 'bert.encoder.layer.8.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.8.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.8.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.8.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.8.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.8.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.attention.self.key.mask_scores', 'bert.encoder.layer.8.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.8.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.8.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.8.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.8.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.8.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.attention.self.value.mask_scores', 'bert.encoder.layer.8.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.8.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.8.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.8.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.8.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.8.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.attention.output.dense.mask_scores', 'bert.encoder.layer.8.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.8.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.8.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.8.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.8.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.8.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.intermediate.dense.mask_scores', 'bert.encoder.layer.8.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.8.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.8.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.8.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.8.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.8.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.output.dense.mask_scores', 'bert.encoder.layer.8.output.dense.ampere_permut_scores', 'bert.encoder.layer.8.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.8.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.8.output.dense.shuffler.in_mapping', 'bert.encoder.layer.8.output.dense.shuffler.out_mapping', 'bert.encoder.layer.8.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.attention.self.query.mask_scores', 'bert.encoder.layer.9.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.9.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.9.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.9.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.9.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.9.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.attention.self.key.mask_scores', 'bert.encoder.layer.9.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.9.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.9.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.9.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.9.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.9.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.attention.self.value.mask_scores', 'bert.encoder.layer.9.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.9.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.9.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.9.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.9.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.9.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.attention.output.dense.mask_scores', 'bert.encoder.layer.9.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.9.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.9.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.9.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.9.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.9.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.intermediate.dense.mask_scores', 'bert.encoder.layer.9.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.9.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.9.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.9.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.9.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.9.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.output.dense.mask_scores', 'bert.encoder.layer.9.output.dense.ampere_permut_scores', 'bert.encoder.layer.9.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.9.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.9.output.dense.shuffler.in_mapping', 'bert.encoder.layer.9.output.dense.shuffler.out_mapping', 'bert.encoder.layer.9.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.attention.self.query.mask_scores', 'bert.encoder.layer.10.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.10.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.10.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.10.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.10.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.10.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.attention.self.key.mask_scores', 'bert.encoder.layer.10.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.10.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.10.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.10.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.10.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.10.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.attention.self.value.mask_scores', 'bert.encoder.layer.10.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.10.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.10.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.10.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.10.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.10.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.attention.output.dense.mask_scores', 'bert.encoder.layer.10.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.10.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.10.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.10.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.10.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.10.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.intermediate.dense.mask_scores', 'bert.encoder.layer.10.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.10.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.10.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.10.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.10.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.10.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.output.dense.mask_scores', 'bert.encoder.layer.10.output.dense.ampere_permut_scores', 'bert.encoder.layer.10.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.10.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.10.output.dense.shuffler.in_mapping', 'bert.encoder.layer.10.output.dense.shuffler.out_mapping', 'bert.encoder.layer.10.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.attention.self.query.mask_scores', 'bert.encoder.layer.11.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.11.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.11.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.11.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.11.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.11.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.attention.self.key.mask_scores', 'bert.encoder.layer.11.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.11.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.11.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.11.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.11.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.11.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.attention.self.value.mask_scores', 'bert.encoder.layer.11.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.11.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.11.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.11.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.11.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.11.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.attention.output.dense.mask_scores', 'bert.encoder.layer.11.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.11.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.11.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.11.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.11.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.11.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.intermediate.dense.mask_scores', 'bert.encoder.layer.11.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.11.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.11.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.11.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.11.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.11.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.output.dense.mask_scores', 'bert.encoder.layer.11.output.dense.ampere_permut_scores', 'bert.encoder.layer.11.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.11.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.11.output.dense.shuffler.in_mapping', 'bert.encoder.layer.11.output.dense.shuffler.out_mapping', 'bert.encoder.layer.11.output.dense.shuffler.out_mapping_reverse', 'classifier.weight', 'classifier.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    I0529 17:55:44.691655 140241775109952 masked_run_glue.py:904] Training/evaluation parameters Namespace(adam_epsilon=1e-08, alpha_ce=0.5, alpha_distil=0.5, cache_dir='', config_name='', data_dir='../data/glue_data/CoLA', device=device(type='cuda'), do_eval=True, do_lower_case=True, do_train=True, eval_all_checkpoints=False, evaluate_during_training=True, final_lambda=0.0, final_threshold=0.15, final_warmup=2, fp16=False, fp16_opt_level='O1', global_topk=False, global_topk_frequency_compute=25, gradient_accumulation_steps=1, initial_threshold=1.0, initial_warmup=1, learning_rate=3e-05, local_rank=-1, logging_steps=50, mask_init='constant', mask_scale=0.0, mask_scores_learning_rate=0.01, max_grad_norm=1.0, max_seq_length=128, max_steps=-1, model_name_or_path='bert-base-uncased', model_type='masked_bert', n_gpu=1, no_cuda=False, num_train_epochs=5.0, output_dir='../outputs1/softmvp/bert-uncased-warmup-glue-cola', output_mode='classification', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=8, pruning_method='topK', regularization=None, save_steps=1000, seed=42, task_name='cola', teacher_name_or_path=None, teacher_type=None, temperature=2.0, tokenizer_name='', warmup_steps=5400, weight_decay=0.0)
    I0529 17:55:44.692138 140241775109952 masked_run_glue.py:529] Loading features from cached file ../data/glue_data/CoLA/cached_train_bert-base-uncased_128_cola
    I0529 17:55:44.834930 140241775109952 masked_run_glue.py:183] ***** Running training *****
    I0529 17:55:44.835000 140241775109952 masked_run_glue.py:184]   Num examples = 8551
    I0529 17:55:44.835042 140241775109952 masked_run_glue.py:185]   Num Epochs = 5
    I0529 17:55:44.835366 140241775109952 masked_run_glue.py:186]   Instantaneous batch size per GPU = 8
    I0529 17:55:44.835401 140241775109952 masked_run_glue.py:191]   Total train batch size (w. parallel, distributed & accumulation) = 8
    I0529 17:55:44.835433 140241775109952 masked_run_glue.py:193]   Gradient Accumulation steps = 1
    I0529 17:55:44.835463 140241775109952 masked_run_glue.py:194]   Total optimization steps = 5345
    Epoch:   0%|                                                                                                                                                                          | 0/5 [00:00<?, ?it/sTraceback (most recent call last):                                                                                                                                                  | 0/1069 [00:00<?, ?it/s]
      File "masked_run_glue.py", line 956, in <module>
        main()
      File "masked_run_glue.py", line 909, in main
        global_step, tr_loss = train(args, train_dataset, model, tokenizer, teacher=teacher)
      File "masked_run_glue.py", line 275, in train
        outputs = model(**inputs)
      File "/home/charles/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
    TypeError: forward() got an unexpected keyword argument 'threshold'
    

    Right now, I think/know in line 272, it is inputs["current_config"] rather than inputs["threshold"], but inputs["current_config"] has three keys: 'threshold', 'ampere_temperature', and 'shuffling_temperature' and I am not sure what the values should be for 'ampere_temperature', and 'shuffling_temperature'. For masked_run_squad.py it's already given, but for masked_run_glue.py it's not and the schedule_threshold() function from masked_run_glue.py is different from masked_run_squad.py

    opened by CharlesLeeeee 1
Owner
Hugging Face
Solving NLP, one commit at a time!
Hugging Face
Dynamical movement primitives (DMPs), probabilistic movement primitives (ProMPs), spatially coupled bimanual DMPs.

Movement Primitives Movement primitives are a common group of policy representations in robotics. There are many different types and variations. This

DFKI Robotics Innovation Center 63 Jan 6, 2023
Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Custom Keras ML block example for Edge Impulse This repository is an example on

Edge Impulse 8 Nov 2, 2022
Differentiable Neural Computers, Sparse Access Memory and Sparse Differentiable Neural Computers, for Pytorch

Differentiable Neural Computers and family, for Pytorch Includes: Differentiable Neural Computers (DNC) Sparse Access Memory (SAM) Sparse Differentiab

ixaxaar 302 Dec 14, 2022
Predict stock movement with Machine Learning and Deep Learning algorithms

Project Overview Stock market movement prediction using LSTM Deep Neural Networks and machine learning algorithms Software and Library Requirements Th

Naz Delam 46 Sep 13, 2022
People movement type classifier with YOLOv4 detection and SORT tracking.

Movement classification The goal of this project would be movement classification of people, in other words, walking (normal and fast) and running. Yo

null 4 Sep 21, 2021
WormMovementSimulation - 3D Simulation of Worm Body Movement with Neurons attached to its body

Generate 3D Locomotion Data This module is intended to create 2D video trajector

null 1 Aug 9, 2022
A robotic arm that mimics hand movement through MediaPipe tracking.

La-Z-Arm A robotic arm that mimics hand movement through MediaPipe tracking. Hardware NVidia Jetson Nano Sparkfun Pi Servo Shield Micro Servos Webcam

Alfred 1 Jun 5, 2022
A curated list of neural network pruning resources.

A curated list of neural network pruning and related resources. Inspired by awesome-deep-vision, awesome-adversarial-machine-learning, awesome-deep-learning-papers and Awesome-NAS.

Yang He 1.7k Jan 9, 2023
Group Fisher Pruning for Practical Network Compression(ICML2021)

Group Fisher Pruning for Practical Network Compression (ICML2021) By Liyang Liu*, Shilong Zhang*, Zhanghui Kuang, Jing-Hao Xue, Aojun Zhou, Xinjiang W

Shilong Zhang 129 Dec 13, 2022
This repo contains the official implementations of EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis This repo contains the official implementations of EigenDamage: Structured Prunin

Chaoqi Wang 107 Apr 20, 2022