Block Sparse movement pruning

Fine-pruning+Distillation (Teacher=BERT-base fine-tuned)	BERT base fine-tuned	Remaining Weights (%)	Magnitude Pruning	L0 Regularization	Movement Pruning	Soft Movement Pruning
SQuAD - Dev EM/F1	80.4/88.1	10% 3%	70.2/80.1 45.5/59.6	72.4/81.9 64.3/75.8	75.6/84.3 67.5/78.0	76.6/84.9 72.7/82.3
MNLI - Dev acc/MM acc	84.5/84.9	10% 3%	78.3/79.3 69.4/70.6	78.7/79.7 76.0/76.2	80.1/80.4 76.5/77.4	81.2/81.8 79.5/80.1
QQP - Dev acc/F1	91.4/88.4	10% 3%	79.8/65.0 72.4/57.8	88.1/82.8 87.0/81.9	89.7/86.2 86.1/81.5	90.2/86.8 89.1/85.5
It does not recognise "threshold" when I try to assign "threshold" as input to the transformer. Below is the output I am getting.
W0529 17:55:39.823194 140241775109952 masked_run_glue.py:838] Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
/home/charles/anaconda3/envs/bertprune/lib/python3.6/site-packages/transformers/data/processors/glue.py:284: FutureWarning: This processor will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py
  warnings.warn(DEPRECATION_WARNING.format("processor"), FutureWarning)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing MaskedBertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing MaskedBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MaskedBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of MaskedBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.attention.self.query.mask_scores', 'bert.encoder.layer.0.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.0.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.0.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.0.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.0.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.0.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.0.attention.self.key.mask_scores', 'bert.encoder.layer.0.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.0.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.0.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.0.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.0.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.0.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.0.attention.self.value.mask_scores', 'bert.encoder.layer.0.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.0.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.0.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.0.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.0.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.0.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.0.attention.output.dense.mask_scores', 'bert.encoder.layer.0.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.0.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.0.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.0.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.0.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.0.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.0.intermediate.dense.mask_scores', 'bert.encoder.layer.0.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.0.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.0.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.0.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.0.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.0.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.0.output.dense.mask_scores', 'bert.encoder.layer.0.output.dense.ampere_permut_scores', 'bert.encoder.layer.0.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.0.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.0.output.dense.shuffler.in_mapping', 'bert.encoder.layer.0.output.dense.shuffler.out_mapping', 'bert.encoder.layer.0.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.attention.self.query.mask_scores', 'bert.encoder.layer.1.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.1.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.1.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.1.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.1.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.1.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.attention.self.key.mask_scores', 'bert.encoder.layer.1.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.1.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.1.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.1.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.1.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.1.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.attention.self.value.mask_scores', 'bert.encoder.layer.1.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.1.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.1.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.1.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.1.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.1.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.attention.output.dense.mask_scores', 'bert.encoder.layer.1.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.1.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.1.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.1.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.1.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.1.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.intermediate.dense.mask_scores', 'bert.encoder.layer.1.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.1.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.1.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.1.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.1.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.1.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.1.output.dense.mask_scores', 'bert.encoder.layer.1.output.dense.ampere_permut_scores', 'bert.encoder.layer.1.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.1.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.1.output.dense.shuffler.in_mapping', 'bert.encoder.layer.1.output.dense.shuffler.out_mapping', 'bert.encoder.layer.1.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.attention.self.query.mask_scores', 'bert.encoder.layer.2.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.2.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.2.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.2.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.2.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.2.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.attention.self.key.mask_scores', 'bert.encoder.layer.2.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.2.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.2.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.2.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.2.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.2.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.attention.self.value.mask_scores', 'bert.encoder.layer.2.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.2.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.2.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.2.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.2.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.2.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.attention.output.dense.mask_scores', 'bert.encoder.layer.2.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.2.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.2.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.2.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.2.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.2.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.intermediate.dense.mask_scores', 'bert.encoder.layer.2.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.2.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.2.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.2.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.2.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.2.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.2.output.dense.mask_scores', 'bert.encoder.layer.2.output.dense.ampere_permut_scores', 'bert.encoder.layer.2.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.2.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.2.output.dense.shuffler.in_mapping', 'bert.encoder.layer.2.output.dense.shuffler.out_mapping', 'bert.encoder.layer.2.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.attention.self.query.mask_scores', 'bert.encoder.layer.3.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.3.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.3.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.3.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.3.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.3.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.attention.self.key.mask_scores', 'bert.encoder.layer.3.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.3.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.3.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.3.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.3.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.3.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.attention.self.value.mask_scores', 'bert.encoder.layer.3.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.3.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.3.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.3.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.3.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.3.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.attention.output.dense.mask_scores', 'bert.encoder.layer.3.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.3.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.3.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.3.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.3.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.3.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.intermediate.dense.mask_scores', 'bert.encoder.layer.3.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.3.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.3.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.3.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.3.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.3.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.3.output.dense.mask_scores', 'bert.encoder.layer.3.output.dense.ampere_permut_scores', 'bert.encoder.layer.3.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.3.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.3.output.dense.shuffler.in_mapping', 'bert.encoder.layer.3.output.dense.shuffler.out_mapping', 'bert.encoder.layer.3.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.attention.self.query.mask_scores', 'bert.encoder.layer.4.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.4.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.4.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.4.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.4.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.4.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.attention.self.key.mask_scores', 'bert.encoder.layer.4.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.4.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.4.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.4.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.4.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.4.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.attention.self.value.mask_scores', 'bert.encoder.layer.4.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.4.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.4.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.4.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.4.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.4.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.attention.output.dense.mask_scores', 'bert.encoder.layer.4.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.4.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.4.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.4.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.4.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.4.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.intermediate.dense.mask_scores', 'bert.encoder.layer.4.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.4.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.4.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.4.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.4.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.4.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.4.output.dense.mask_scores', 'bert.encoder.layer.4.output.dense.ampere_permut_scores', 'bert.encoder.layer.4.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.4.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.4.output.dense.shuffler.in_mapping', 'bert.encoder.layer.4.output.dense.shuffler.out_mapping', 'bert.encoder.layer.4.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.attention.self.query.mask_scores', 'bert.encoder.layer.5.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.5.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.5.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.5.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.5.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.5.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.attention.self.key.mask_scores', 'bert.encoder.layer.5.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.5.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.5.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.5.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.5.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.5.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.attention.self.value.mask_scores', 'bert.encoder.layer.5.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.5.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.5.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.5.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.5.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.5.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.attention.output.dense.mask_scores', 'bert.encoder.layer.5.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.5.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.5.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.5.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.5.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.5.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.intermediate.dense.mask_scores', 'bert.encoder.layer.5.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.5.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.5.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.5.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.5.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.5.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.5.output.dense.mask_scores', 'bert.encoder.layer.5.output.dense.ampere_permut_scores', 'bert.encoder.layer.5.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.5.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.5.output.dense.shuffler.in_mapping', 'bert.encoder.layer.5.output.dense.shuffler.out_mapping', 'bert.encoder.layer.5.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.attention.self.query.mask_scores', 'bert.encoder.layer.6.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.6.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.6.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.6.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.6.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.6.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.attention.self.key.mask_scores', 'bert.encoder.layer.6.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.6.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.6.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.6.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.6.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.6.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.attention.self.value.mask_scores', 'bert.encoder.layer.6.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.6.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.6.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.6.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.6.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.6.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.attention.output.dense.mask_scores', 'bert.encoder.layer.6.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.6.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.6.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.6.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.6.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.6.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.intermediate.dense.mask_scores', 'bert.encoder.layer.6.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.6.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.6.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.6.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.6.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.6.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.6.output.dense.mask_scores', 'bert.encoder.layer.6.output.dense.ampere_permut_scores', 'bert.encoder.layer.6.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.6.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.6.output.dense.shuffler.in_mapping', 'bert.encoder.layer.6.output.dense.shuffler.out_mapping', 'bert.encoder.layer.6.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.attention.self.query.mask_scores', 'bert.encoder.layer.7.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.7.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.7.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.7.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.7.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.7.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.attention.self.key.mask_scores', 'bert.encoder.layer.7.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.7.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.7.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.7.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.7.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.7.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.attention.self.value.mask_scores', 'bert.encoder.layer.7.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.7.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.7.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.7.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.7.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.7.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.attention.output.dense.mask_scores', 'bert.encoder.layer.7.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.7.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.7.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.7.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.7.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.7.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.intermediate.dense.mask_scores', 'bert.encoder.layer.7.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.7.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.7.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.7.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.7.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.7.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.7.output.dense.mask_scores', 'bert.encoder.layer.7.output.dense.ampere_permut_scores', 'bert.encoder.layer.7.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.7.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.7.output.dense.shuffler.in_mapping', 'bert.encoder.layer.7.output.dense.shuffler.out_mapping', 'bert.encoder.layer.7.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.attention.self.query.mask_scores', 'bert.encoder.layer.8.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.8.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.8.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.8.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.8.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.8.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.attention.self.key.mask_scores', 'bert.encoder.layer.8.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.8.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.8.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.8.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.8.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.8.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.attention.self.value.mask_scores', 'bert.encoder.layer.8.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.8.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.8.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.8.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.8.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.8.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.attention.output.dense.mask_scores', 'bert.encoder.layer.8.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.8.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.8.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.8.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.8.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.8.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.intermediate.dense.mask_scores', 'bert.encoder.layer.8.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.8.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.8.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.8.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.8.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.8.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.8.output.dense.mask_scores', 'bert.encoder.layer.8.output.dense.ampere_permut_scores', 'bert.encoder.layer.8.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.8.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.8.output.dense.shuffler.in_mapping', 'bert.encoder.layer.8.output.dense.shuffler.out_mapping', 'bert.encoder.layer.8.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.attention.self.query.mask_scores', 'bert.encoder.layer.9.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.9.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.9.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.9.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.9.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.9.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.attention.self.key.mask_scores', 'bert.encoder.layer.9.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.9.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.9.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.9.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.9.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.9.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.attention.self.value.mask_scores', 'bert.encoder.layer.9.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.9.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.9.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.9.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.9.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.9.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.attention.output.dense.mask_scores', 'bert.encoder.layer.9.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.9.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.9.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.9.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.9.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.9.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.intermediate.dense.mask_scores', 'bert.encoder.layer.9.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.9.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.9.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.9.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.9.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.9.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.9.output.dense.mask_scores', 'bert.encoder.layer.9.output.dense.ampere_permut_scores', 'bert.encoder.layer.9.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.9.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.9.output.dense.shuffler.in_mapping', 'bert.encoder.layer.9.output.dense.shuffler.out_mapping', 'bert.encoder.layer.9.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.attention.self.query.mask_scores', 'bert.encoder.layer.10.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.10.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.10.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.10.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.10.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.10.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.attention.self.key.mask_scores', 'bert.encoder.layer.10.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.10.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.10.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.10.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.10.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.10.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.attention.self.value.mask_scores', 'bert.encoder.layer.10.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.10.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.10.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.10.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.10.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.10.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.attention.output.dense.mask_scores', 'bert.encoder.layer.10.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.10.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.10.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.10.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.10.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.10.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.intermediate.dense.mask_scores', 'bert.encoder.layer.10.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.10.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.10.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.10.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.10.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.10.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.10.output.dense.mask_scores', 'bert.encoder.layer.10.output.dense.ampere_permut_scores', 'bert.encoder.layer.10.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.10.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.10.output.dense.shuffler.in_mapping', 'bert.encoder.layer.10.output.dense.shuffler.out_mapping', 'bert.encoder.layer.10.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.attention.self.query.mask_scores', 'bert.encoder.layer.11.attention.self.query.ampere_permut_scores', 'bert.encoder.layer.11.attention.self.query.shuffler.in_permutation_scores', 'bert.encoder.layer.11.attention.self.query.shuffler.out_permutation_scores', 'bert.encoder.layer.11.attention.self.query.shuffler.in_mapping', 'bert.encoder.layer.11.attention.self.query.shuffler.out_mapping', 'bert.encoder.layer.11.attention.self.query.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.attention.self.key.mask_scores', 'bert.encoder.layer.11.attention.self.key.ampere_permut_scores', 'bert.encoder.layer.11.attention.self.key.shuffler.in_permutation_scores', 'bert.encoder.layer.11.attention.self.key.shuffler.out_permutation_scores', 'bert.encoder.layer.11.attention.self.key.shuffler.in_mapping', 'bert.encoder.layer.11.attention.self.key.shuffler.out_mapping', 'bert.encoder.layer.11.attention.self.key.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.attention.self.value.mask_scores', 'bert.encoder.layer.11.attention.self.value.ampere_permut_scores', 'bert.encoder.layer.11.attention.self.value.shuffler.in_permutation_scores', 'bert.encoder.layer.11.attention.self.value.shuffler.out_permutation_scores', 'bert.encoder.layer.11.attention.self.value.shuffler.in_mapping', 'bert.encoder.layer.11.attention.self.value.shuffler.out_mapping', 'bert.encoder.layer.11.attention.self.value.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.attention.output.dense.mask_scores', 'bert.encoder.layer.11.attention.output.dense.ampere_permut_scores', 'bert.encoder.layer.11.attention.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.11.attention.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.11.attention.output.dense.shuffler.in_mapping', 'bert.encoder.layer.11.attention.output.dense.shuffler.out_mapping', 'bert.encoder.layer.11.attention.output.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.intermediate.dense.mask_scores', 'bert.encoder.layer.11.intermediate.dense.ampere_permut_scores', 'bert.encoder.layer.11.intermediate.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.11.intermediate.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.11.intermediate.dense.shuffler.in_mapping', 'bert.encoder.layer.11.intermediate.dense.shuffler.out_mapping', 'bert.encoder.layer.11.intermediate.dense.shuffler.out_mapping_reverse', 'bert.encoder.layer.11.output.dense.mask_scores', 'bert.encoder.layer.11.output.dense.ampere_permut_scores', 'bert.encoder.layer.11.output.dense.shuffler.in_permutation_scores', 'bert.encoder.layer.11.output.dense.shuffler.out_permutation_scores', 'bert.encoder.layer.11.output.dense.shuffler.in_mapping', 'bert.encoder.layer.11.output.dense.shuffler.out_mapping', 'bert.encoder.layer.11.output.dense.shuffler.out_mapping_reverse', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
I0529 17:55:44.691655 140241775109952 masked_run_glue.py:904] Training/evaluation parameters Namespace(adam_epsilon=1e-08, alpha_ce=0.5, alpha_distil=0.5, cache_dir='', config_name='', data_dir='../data/glue_data/CoLA', device=device(type='cuda'), do_eval=True, do_lower_case=True, do_train=True, eval_all_checkpoints=False, evaluate_during_training=True, final_lambda=0.0, final_threshold=0.15, final_warmup=2, fp16=False, fp16_opt_level='O1', global_topk=False, global_topk_frequency_compute=25, gradient_accumulation_steps=1, initial_threshold=1.0, initial_warmup=1, learning_rate=3e-05, local_rank=-1, logging_steps=50, mask_init='constant', mask_scale=0.0, mask_scores_learning_rate=0.01, max_grad_norm=1.0, max_seq_length=128, max_steps=-1, model_name_or_path='bert-base-uncased', model_type='masked_bert', n_gpu=1, no_cuda=False, num_train_epochs=5.0, output_dir='../outputs1/softmvp/bert-uncased-warmup-glue-cola', output_mode='classification', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=8, pruning_method='topK', regularization=None, save_steps=1000, seed=42, task_name='cola', teacher_name_or_path=None, teacher_type=None, temperature=2.0, tokenizer_name='', warmup_steps=5400, weight_decay=0.0)
I0529 17:55:44.692138 140241775109952 masked_run_glue.py:529] Loading features from cached file ../data/glue_data/CoLA/cached_train_bert-base-uncased_128_cola
I0529 17:55:44.834930 140241775109952 masked_run_glue.py:183] ***** Running training *****
I0529 17:55:44.835000 140241775109952 masked_run_glue.py:184]   Num examples = 8551
I0529 17:55:44.835042 140241775109952 masked_run_glue.py:185]   Num Epochs = 5
I0529 17:55:44.835366 140241775109952 masked_run_glue.py:186]   Instantaneous batch size per GPU = 8
I0529 17:55:44.835401 140241775109952 masked_run_glue.py:191]   Total train batch size (w. parallel, distributed & accumulation) = 8
I0529 17:55:44.835433 140241775109952 masked_run_glue.py:193]   Gradient Accumulation steps = 1
I0529 17:55:44.835463 140241775109952 masked_run_glue.py:194]   Total optimization steps = 5345
Epoch:   0%|                                                                                                                                                                          | 0/5 [00:00<?, ?it/sTraceback (most recent call last):                                                                                                                                                  | 0/1069 [00:00<?, ?it/s]
  File "masked_run_glue.py", line 956, in <module>
    main()
  File "masked_run_glue.py", line 909, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer, teacher=teacher)
  File "masked_run_glue.py", line 275, in train
    outputs = model(**inputs)
  File "/home/charles/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'threshold'
Right now, I think/know in line 272, it is inputs["current_config"] rather than inputs["threshold"], but inputs["current_config"] has three keys: 'threshold', 'ampere_temperature', and 'shuffling_temperature' and I am not sure what the values should be for 'ampere_temperature', and 'shuffling_temperature'. For masked_run_squad.py it's already given, but for masked_run_glue.py it's not and the schedule_threshold() function from masked_run_glue.py is different from masked_run_squad.py