The Codebase for Causal Distillation for Language Models.

Zen

Last update: Dec 31, 2022

Related tags

Overview

Causal Distillation for Language Models

Zhengxuan Wu*,Atticus Geiger*, Josh Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, Noah D. Goodman

The is an implementation of our preprint Causal Distillation for Language Models. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training (IIT).

We fork our main codebase from the Huggingface Distillation Interface.

Release Notes

✅ 12/02/2021 Our paper on Interchange Intervention Training (IIT) is released! Read this more formal definition of the method.
✅ 12/06/2021 Released the causal distillation codebase with the preprint.
✅ 12/06/2021 Released evaluation results on distilled tiny-BERT (3 layers) with the Wiki-Text 103M dataset.
⬜️ Released evaluation results on causal-distilled tiny-BERT (3 layers) with the Wiki-Text 103M + BookCorpus dataset.
⬜️ Released evaluation results on causal-distilled BERT (6 layers) with the Wiki-Text 103M + BookCorpus dataset.
⬜️ Released more ablation studies.
⬜️ Released causal-distilled tiny-BERT (3 layers) model files.
⬜️ Released causal-distilled BERT (6 layers) model files.

If you experience any issues or have suggestions, please contact me either thourgh the issues page or at [email protected].

Benchmark Results

Here are the results on the dev sets of GLUE:

Model	Average-score	CoLA	MNLI	MRPC	QNLI	QQP	RTE	SST-2	STS-B	WNLI
DistilBERT (3 layers)	67.8¹	22.8	71.6	78.2	82.1	84.3	55.4	86.5	56.7	24.2
CausalBERT (3 layers)	69.7¹	25.0	72.9	78.6	83.1	84.9	55.4	86.9	66.5	21.5

¹ Average-score computed without WNLI.

Citation

If you use this repository, please cite the following two papers: paper for interchange intervention training, and paper for the our distillation method.

  @article{geiger-etal-2021-iit,
        title={Inducing Causal Structure for Interpretable Neural Networks}, 
        author={Geiger, Atticus and Wu, Zhengxuan and Lu, Hanson and Rozner, Josh and Kreiss, Elisa and Icard, Thomas and Goodman, Noah D. and Potts, Christopher},
        year={2021},
        eprint={2112.00826},
        archivePrefix={arXiv},
        primaryClass={cs.LG}
  }

  @article{wu-etal-2021-distill,
        title={Causal Distillation for Language Models}, 
        author={Wu, Zhengxuan and Geiger, Atticus and Rozner, Josh and Kreiss, Elisa and Lu, Hanson and Icard, Thomas and Potts, Christopher and Goodman, Noah D.},
        year={2021},
        eprint={2112.02505},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
  }

Requirements

Python 3.6 or 3.7 are supported.
Pytorch Version: 1.9.0
Transfermers Version: 4.11.3
Datasets Version: Version: 1.8.0
We have performed experiments on Titan V GPU. We assume 12GB of GPU memory (more memory can expedite training).
Since we build our codebase off the Huggingface Distillation Interface, please review their doc for requirements.

Dataset

Following the Huggingface Distillation Interface, we need to pre-process the datasets before we do distillation. You can refer to their repo for details. We adapt their pre-processing scripts, and update with a few improvements. For example, we can now binarize datasets from the Dataset Hub from huggingface directly.

# preprocessing from disk
python script/binarized_data.py \
--file_path ../../bert-mid-tuning/data-files/wikitext-15M \
--split train \
--field_name text \
--max_parsing_example 1000 \
--tokenizer_type bert \
--tokenizer_name bert-base-uncased \
--dump_file ./data/binarized_text

# preprocessing from huggingface.
python scripts/binarized_data.py \
--dataset_name bookcorpus \
--split train \
--field_name text \
--tokenizer_type bert \
--tokenizer_name bert-base-uncased \
--dump_file bookcorpus-dataset/binarized_text \
--cache_dir ./distill_cache/

python scripts/binarized_data.py \
--dataset_name wikitext \
--split train \
--field_name text \
--tokenizer_type bert \
--tokenizer_name bert-base-uncased \
--dump_file wikitext-dataset/binarized_text \
--cache_dir ./distill_cache/

python scripts/binarized_data.py \
--dataset_name wikitext+bookcorpus \
--split train \
--field_name text \
--tokenizer_type bert \
--tokenizer_name bert-base-uncased \
--dump_file wikitext+bookcorpus-dataset/binarized_text \
--cache_dir ./distill_cache/

# helper scripts to combine two binarized data files
python scripts/data_combinator.py \
--file_path_left ./bookcorpus-dataset/binarized_text.train.bert-base-uncased.pickle \
--file_path_right ./wikitext-dataset/binarized_text.train.bert-base-uncased.pickle \
--split train \
--tokenizer_name bert-base-uncased \
--dump_file wikitext+bookcorpus-dataset/binarized_text

# multiprocessing preprocessor.
python scripts/binarized_data.py \
--dataset_name bookcorpus \
--split train \
--field_name text \
--tokenizer_type bert \
--tokenizer_name bert-base-uncased \
--dump_file bookcorpus-dataset/binarized_text \
--cache_dir ./distill_cache/ \
--fast_process \
--preprocessing_num_workers 48

After you get the datasets ready, you need to generate token counts as well.

python scripts/token_counts.py \
--data_file data/binarized_text.train.bert-base-uncased.pickle \
--token_counts_dump data/binarized_text.train.token_counts.bert-base-uncased.pickle \
--vocab_size 30522

Distillation

Before training, we recommand you to initialize your student model with weights extracted from the teacher model.

python scripts/extract_distilbert.py \
--model_type bert \
--model_name bert-base-uncased \
--dump_checkpoint ./distillation_checkpoints/bert-base-uncased_num_layer_3.pth \
--num_layers 3

Now, here is an example for you to distill with our causal distillation objective or without,

CUDA_VISIBLE_DEVICES=9,4 python causal_train.py \
--force \
--n_gpu 2 \
--is_wandb \
--log_interval 10 \
--student_type distilbert \
--student_config ./training_configs/distilbert-base-uncased-small.json \
--student_pretrained_weights ./distillation_checkpoints/bert-base-uncased_num_layer_3.pth \
--teacher_type bert \
--teacher_name bert-base-uncased \
--neuron_mapping ./training_configs/single_middle.nm \
--mlm --alpha_ce 0.25 --alpha_mlm 0.25 --alpha_cos 0.25 --alpha_clm 0.0 --alpha_causal 0.25 \
--freeze_pos_embs \
--dump_path ./results/ \
--data_file ./wikitext-15M/binarized_text.train.bert-base-uncased.pickle \
--token_counts ./wikitext-15M/binarized_text.train.token_counts.bert-base-uncased.pickle \
--seed 42 \
--gradient_accumulation_steps 50 \
--n_epoch 3 \
--batch_size 5

CUDA_VISIBLE_DEVICES=0,1,2,3 python causal_train.py \
--force \
--n_gpu 4 \
--is_wandb \
--log_interval 10 \
--student_type distilbert \
--student_config ./training_configs/distilbert-base-uncased-small.json \
--student_pretrained_weights ./distillation_checkpoints/bert-base-uncased_num_layer_3.pth \
--teacher_type bert \
--teacher_name bert-base-uncased \
--neuron_mapping ./training_configs/single_middle.nm \
--mlm --alpha_ce 0.33 --alpha_mlm 0.33 --alpha_cos 0.33 --alpha_clm 0.0 --alpha_causal 0.00 \
--freeze_pos_embs \
--dump_path ./results/ \
--data_file ./wikitext-15M/binarized_text.train.bert-base-uncased.pickle \
--token_counts ./wikitext-15M/binarized_text.train.token_counts.bert-base-uncased.pickle \
--seed 42 \
--gradient_accumulation_steps 124 \
--n_epoch 6 \
--batch_size 4

Note that you can simply turn our causal distillation objective on/off through setting the arguments.

Evaluation

After you get your distilled models, you need to fine-tune them and evaluate them with downstream tasks. We provide you all the scripts you need to run.

MLM Evaluation

CUDA_VISIBLE_DEVICES=5 python run_mlm.py \
--model_name_or_path ./results/s_distilbert_t_bert_data_wikitext-15M_seed_42_mlm_True_ce_0.25_mlm_0.25_cos_0.25_causal_0.25_nm_single_multilayer/ \
--dataset_dir ../../bert-mid-tuning/data-files/wikitext-15M/ \
--tokenizer_name bert-base-uncased \
--do_eval \
--output_dir /tmp/test-mlm \
--cache_dir ./distill_cache/

GLUE Evaluation

CUDA_VISIBLE_DEVICES=5,7,8,9 python run_glue.py \
--model_name_or_path ./results/s_distilbert_t_bert_data_wikitext-dataset_seed_42_mlm_True_ce_0.33_mlm_0.33_cos_0.33_causal_0.0_nm_single_middle/ \
--tokenizer_name bert-base-uncased \
--task_name sst2 \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir ./results/ \
--save_total_limit 1 \
--cache_dir ./distill_cache/

CoNLL Evaluation

CUDA_VISIBLE_DEVICES=2,3,7,8 python run_ner.py \
--model_name_or_path ./results/s_distilbert_t_bert_data_wikitext-dataset_seed_42_mlm_True_ce_0.33_mlm_0.33_cos_0.33_causal_0.0_nm_single_middle_crossway_False/ \
--tokenizer_name bert-base-uncased \
--dataset_name conll2003 \
--do_train \
--do_eval \
--output_dir ./ner_results/ \
--save_total_limit 1 \
--cache_dir ./distill_cache/

SQuAD Evaluation

CUDA_VISIBLE_DEVICES=2,3,7,8 python run_qa.py \
--model_name_or_path ./results/s_distilbert_t_bert_data_wikitext-dataset_seed_42_mlm_True_ce_0.33_mlm_0.33_cos_0.33_causal_0.0_nm_single_middle_crossway_False/ \
--tokenizer_name bert-base-uncased \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--save_total_limit 1 \
--output_dir ./qa_results/

Comments

Pre-Training questions
Hi @frankaging ,

thanks for releasing the code and paper for your causal distillation approach :hugs:

I have some basic questions regarding the distillation process:

I would like to train new distilled models for some of my previous trained models (such as DistilBERTurk or German DistilBERT), so I would like to know, which implementation I could use it. E.g. there's a current dev branch, which has some more recent changes.

In the current readme, the following command is used for causal distillation:

CUDA_VISIBLE_DEVICES=9,4 python causal_train.py \ --force \ --n_gpu 2 \ --is_wandb \ --log_interval 10 \ --student_type distilbert \ --student_config ./training_configs/distilbert-base-uncased-small.json \ --student_pretrained_weights ./distillation_checkpoints/bert-base-uncased_num_layer_3.pth \ --teacher_type bert \ --teacher_name bert-base-uncased \ --neuron_mapping ./training_configs/single_middle.nm \ --mlm --alpha_ce 0.25 --alpha_mlm 0.25 --alpha_cos 0.25 --alpha_clm 0.0 --alpha_causal 0.25 \ --freeze_pos_embs \ --dump_path ./results/ \ --data_file ./wikitext-15M/binarized_text.train.bert-base-uncased.pickle \ --token_counts ./wikitext-15M/binarized_text.train.token_counts.bert-base-uncased.pickle \ --seed 42 \ --gradient_accumulation_steps 50 \ --n_epoch 3 \ --batch_size 5

This raised a few questions: it seems that only 2 GPUs are used, whereas the paper mentions 4 TITAN GPUs. The total batch size per device is 50 x 5 = 250, so the total batch size used for training is 2 x 250 = 500. Could you please specify the hyperparams you've used for the paper model :thinking:

Did you perform some experiments with using fp16?

Many thanks in advance!
opened by stefan-it 2
forward() got an unexpected keyword argument 'interchanged_variables'

Hi @frankaging , when i run causal_training.py i have Error: forward() got an unexpected keyword argument 'interchanged_variables' log 01/14/2022 07:41:12 - INFO - utils - PID: 2342567 - Using MLM loss for LM step. 01/14/2022 07:41:12 - INFO - utils - PID: 2342567 - --- Initializing model optimizer 01/14/2022 07:41:12 - INFO - utils - PID: 2342567 - ------ Number of trainable parameters (student): 91252605 01/14/2022 07:41:12 - INFO - utils - PID: 2342567 - ------ Number of parameters (student): 91450749 01/14/2022 07:41:12 - INFO - utils - PID: 2342567 - Distiller initialization done. 01/14/2022 07:41:12 - INFO - utils - PID: 2342567 - Starting training 01/14/2022 07:41:12 - INFO - utils - PID: 2342567 - --- Starting epoch 0/3 -Iter: 0%| | 0/335884 [00:00<?, ?it/s] Traceback (most recent call last): File "causal_train.py", line 434, in <module> distiller.train() File "/home/chinhh/workspace/Causal-Distill/distillation/causal_distiller.py", line 483, in train is_crossway=self.params.include_crossway, File "/home/chinhh/workspace/Causal-Distill/distillation/causal_distiller.py", line 574, in step skip_update_iter=False, File "/home/chinhh/workspace/Causal-Distill/distillation/causal_distiller.py", line 639, in _step sampled_interchange_position=sampled_interchange_position, File "/home/chinhh/miniconda3/envs/distillation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) TypeError: forward() got an unexpected keyword argument 'interchanged_variables' please, help me. thanks for releasing the code and paper for your causal distillation approach

opened by Huynh-Chinh 5

The Codebase for Causal Distillation for Language Models.

Related tags

Overview

Causal Distillation for Language Models

Release Notes

Benchmark Results

Main Contents

Citation

Requirements

Dataset

Distillation

Evaluation

MLM Evaluation

GLUE Evaluation

CoNLL Evaluation

SQuAD Evaluation

You might also like...

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.

Code for "Causal autoregressive flows" - AISTATS, 2021

[ICCV 2021] Released code for Causal Attention for Unbiased Visual Recognition

Causal estimators for use with WhyNot

Multi-task Learning of Order-Consistent Causal Graphs (NeuRIPs 2021)

JudeasRx - graphical app for doing personalized causal medicine using the methods invented by Judea Pearl et al.

Code for NeurIPS 2021 paper: Invariant Causal Imitation Learning for Generalizable Policies

Comments

Pre-Training questions

forward() got an unexpected keyword argument 'interchanged_variables'

Owner

Zen

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

CausaLM: Causal Model Explanation Through Counterfactual Language Models

Deep Learning Models for Causal Inference

Official codebase for ICLR oral paper Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

Codebase for Diffusion Models Beat GANS on Image Synthesis.

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

This repo uses a combination of logits and feature distillation method to teach the PSPNet model of ResNet18 backbone with the PSPNet model of ResNet50 backbone. All the models are trained and tested on the PASCAL-VOC2012 dataset.