Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm
This is the Pytorch implementation for sparse progressive distillation (SPD). For more details about the motivation, techniques and experimental results, refer to our paper here.
Running
-
Environment Preparation (using python3)
pip install -r requirements.txt
-
Dataset Preparation
The original GLUE dataset could be downloaded here.
BERT_base fine-tuning on GLUE
We use finetuned BERT_base as the teacher. For each task of GLUE benchmark, we obtain the finetuned model using the original huggingface transformers code with the following script.
python run_glue.py \
--model_name_or_path $INT_DIR \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 3e-5 \
--num_train_epochs 4.0 \
--output_dir $OUT_DIR \
--evaluate_during_training \
--overwrite_output_dir \
--logging_steps 400 \
--logging_dir $OUT_DIR \
--save_steps 10000
Sparse Progressive Distillation
We use run_glue.py
to run the sparse progressive distillation. --num_prune_epochs is the epochs for pruning. --num_train_epochs is the total number of epochs (pruning, progressive distillation, finetuning).
python run_glue.py \
--model_name_or_path PATH_TO_FINETUNED_MODEL \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 6.4e-4 \
--save_steps 50 \
--num_prune_epochs 30 \
--num_train_epochs 60 \
--sparsity 0.9 \
--output_dir $OUT_DIR \
--evaluate_during_training \
--replacing_rate 0.8 \
--overwrite_output_dir \
--steps_for_replacing 0 \
--scheduler_type linear
To Dos
-
Provide our teacher model for each task.
-
Provide best performed model checkpoint for each task.