Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

Related tags

Text Data & NLP DinkyTrain

Overview

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers more flexibility when using our training scripts, while also making it easier to adapt our code contributions into other projects.

Why DinkyTrain?

The Dinky runs between Princeton Junction and Princeton and is the shortest scheduled commuter rail line in the United States. We also aim to make pre-training short and accessible to everyone.

Our Contributions

DeepSpeed transformer kernel integration
A training recipe for efficient MLM pre-training
An easy-to-follow guideline of using fairseq for MLM pre-training.

Other fairseq features:

Multi-GPU training on one machine or across multiple machines (data and model parallel)
Gradient accumulation enables training with large mini-batches even on a single GPU
Mixed precision training (trains faster with less GPU memory on NVIDIA tensor cores)
Extensible: easily register new models, criterions, tasks, optimizers and learning rate schedulers
Flexible configuration based on Hydra allowing a combination of code, command-line and file based configuration
Full parameter and optimizer state sharding
Offloading parameters to CPU

See the fairseq repo and its documentation for more details on how to use and extend fairseq.

DinkyTrain for Efficient MLM Pre-training

Overview

You can reproduce the pre-training experiments of our recent paper Should You Mask 15% in Masked Language Modeling?, where we find that higher masking rates can lead to more efficient pre-training.

Installation

PyTorch version >= 1.5.0
Python version >= 3.6
To install fairseq and develop locally:

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

For faster training (FP16) install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

For faster training (DeepSpeed cuda kernel) install DeepSpeed library and compile the DeepSpeed kernel

DS_BUILD_TRANSFORMER=1 DS_BUILD_STOCHASTIC_TRANSFORMER=1 pip install deepspeed

For large datasets install PyArrow: pip install pyarrow
If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .

Trouble-shooting:

If using lower version of Python, you might encounter import problems with importlib.metadata. Try pip install importlib-metadata.
To install apex and deepspeed, you will need nvcc (CUDA compiler).
When installing apex, if you encounter the error Cuda extensions are bing compiled with a version of Cuda that does not match ..., go to setup.py and comment out the line that raised the error (at your own risk).
Both apex and deepspeed installation require a high gcc version to support c++14. If you encounter relevant errors, update your gcc.

Data Pre-processing

Tokenization: First, download the GPT2 BPE vocabulary:

wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe

Then, tokenize your raw data:

python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json gpt2_bpe/encoder.json \
    --vocab-bpe gpt2_bpe/vocab.bpe \
    --inputs ${SPLIT}.raw \
    --outputs ${SPLIT}.bpe \
    --keep-empty \
    --workers 8

Finally, index and binarize your data:

fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref ${TRAIN_SPLIT}.bpe \
    --validpref ${VALID_SPLIT}.bpe \
    --testpref ${TEST_SPLIT}.bpe \
    --destdir output-bin \
    --workers 8

Alternatively: Use our pre-processed data: We preprocessed Wikipedia+BookCorpus and shared it on Huggingface dataset. It is ~22GB and contains two epochs of data, each epoch being sliced into 8 shards. You can download it using git:

git lfs install # Git lfs is needed for downloading
git clone https://huggingface.co/datasets/princeton-nlp/wikibook_fairseq_format

Pre-training

Use our script for efficient pre-training

GPU={number of GPUs} DATA_DIR={data path} [DEEPSPEED=1] bash run_efficient_mlm_recipe.sh

Flags explained

GPU: number of GPUs.
DATA_DIR: directory to the processed pre-training data. If you are using our preprocessed dataset, DATA_DIR should be:

DATA_DIR=$(seq 0 15 | sed -e 's/^/wikibook_fairseq_format\/bin-shard/' | sed -e 's/$/-8/' | paste -sd ':')

DEEPSPEED (optional): if set to 1, the DeepSpeed CUDA kernel will be used.

Please refer to the script for more hyperparameter choices.

Fine-tuning on GLUE and SQuAD

All our checkpoints can be converted to HuggingFace transformers models (see next nextion) and use the transformers package for fine-tuning. Fairseq also supports fine-tuning on GLUE.

First, download the preprocessed GLUE data (you can also process by yourself following the preprocess section above):

git lfs install # Git lfs is needed for downloading
git clone https://huggingface.co/datasets/princeton-nlp/glue_fairseq_format

Then use the following script for fine-tuning

DATA_DIR={path to the data directory} \
TASK={glue task name (mnli qnli qqp rte sst2 mrpc cola stsb)} \
LR={learning rate} \
BSZ={batch size} \
EPOCHS={number of epochs} \
SEED={random seed} \
CKPT_DIR={checkpoint's directory} \
CKPT_NAME={checkpoint's name} \
[DEEPSPEED=1] bash finetune_glue.sh

For fine-tuning on SQuAD, please convert the models to HuggingFace checkpoints following the next section and use HuggingFace's examples.

Convert to HuggingFace

We also provide conversion codes so that you can easily turn Fairseq checkpoints into HuggingFace checkpoints. Usage:

cd scripts
[PRELAYERNORM=1] [FROM_DS=1] python convert_fs_ckpt_to_hf_ckpt.py --fr {fairseq checkpoint} --to {huggingface checkpoint path} --hf_model_config {roberta-base/roberta-large}

Flags explained:

PRELAYERNORM=1: Using pre layer-norm (default is post layer-norm).
FROM_DS=1: The Fairseq checkpoint uses DeepSpeed's cuda kernel.
--fr: The path to the Fairseq checkpoint.
--to: The path you want to save the HuggingFace checkpoint to.
--hf_model_config: roberta-base or roberta-large.

IMPORTANT: all our models use pre layer norm, which is not supported by HuggingFace yet. To use it, import the model class from huggingface/modeling_roberta_prelayernorm.py. For example:

from huggingface.modeling_roberta_prelayernorm import RobertaForSequenceClassification

For more configuration, please refer to convert_fs_ckpt_to_hf_ckpt.py.

Model List

Here are the HuggingFace checkpoints of our models in the paper Should You Mask 15% in Masked Language Modeling. Results are development set performance.

Model	MNLI	QNLI	QQP	SST-2
princeton-nlp/efficient_mlm_m0.15	84.2	90.9	87.8	93.3
princeton-nlp/efficient_mlm_m0.20	84.1	91.3	87.9	92.7
princeton-nlp/efficient_mlm_m0.30	84.2	91.6	88.0	93.0
princeton-nlp/efficient_mlm_m0.40	84.5	91.6	88.1	92.8
princeton-nlp/efficient_mlm_m0.50	84.1	91.1	88.1	92.7
princeton-nlp/efficient_mlm_m0.60	83.2	90.7	87.8	92.6
princeton-nlp/efficient_mlm_m0.70	82.3	89.4	87.5	91.9
princeton-nlp/efficient_mlm_m0.80	80.8	87.9	87.1	90.5
princeton-nlp/efficient_mlm_m0.15-801010	83.7	90.4	87.8	93.2
princeton-nlp/efficient_mlm_m0.40-801010	84.3	91.2	87.9	93.0

We also offer the original (deepspeed) fairseq checkpoints here.

Bugs or Questions?

If you hav an questions, or encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

@article{wettig2022should,
   title={Should You Mask 15% in Masked Language Modeling?},
   author={Wettig, Alexander and Gao, Tianyu and Zhong, Zexuan and Chen, Danqi},
   boo={arXiv preprint arXiv:2202.08005},
   year={2022}
}

Acknowledgment

Our package is based on fairseq:

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.

Our efficient training recipe is based on the following paper:

Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. How to train BERT with an academic budget. In Empirical Methods in Natural Language Processing (EMNLP), pages 10644–10652.

Comments

Why do you use last checkpoint for validation rather than best checkpoint?

Hi,

in your script https://github.com/princeton-nlp/DinkyTrain/blob/main/finetune_glue.sh, it seems you use last checkpoint to validate. May I ask why don't you use the best checkpoint?

In addition, if you use the last checkpoint to validate, does it means that you also use it to test?

opened by BaohaoLiao 4
Prediction values of the STS-B test set are not in 0~5

Hi,

As we know, the STS-B task is a regression task where the targets are in [0, 5]. The .csv file submitted to the GLUE leaderboard is also required to be in [0, 5]. Otherwise, errors appear in the GLUE submission system.

During data preprocessing for GLUE data, fairseq script normalizes the target values to [0, 1]. MSE loss is applied to compute the difference between the logits and the normalized targets. For prediction, we need to multiply 5.

However, during prediction, how could we make sure the predicted values are restricted to [0, 1]? Since there is no activation function, like sigmoid, for the logits.

opened by BaohaoLiao 4
Roberta recipe in your paper is different from the original recipe

Hi,

in Table 8 in https://arxiv.org/pdf/2202.08005.pdf, the recipe is different from the original Roberta recipe. Roberta_large uses a batch size of 8196, a peak learning rate of 4e-4 and trains for 100K steps. Your parameter setting seems from Table 3 in https://arxiv.org/pdf/1907.11692.pdf. But this recipe is used for Roberta_base instead of Roberta_large.

opened by BaohaoLiao 3
Could you please tell how to set the hyparameters of the GLUE?

Hello, I noticed that you gave a search space of hyparameters on GLUE dataset. I am confused that how you search the hyparameters. Did you train each task on each of the hyparameters with different seeds? There are about fifty combinations of parameters. Did you finetune GLUE with each of the parameters? Thank you.

opened by leoozy 2
fairseq-train: error: argument --arch/-a: invalid choice: 'deepspeed_roberta_large'

Hello, Thank you for your code. I am trying to reproducing your results by "GPU=8 DATA_DIR=/dev/gbert/dataset DEEPSPEED=1 bash run_efficient_mlm_recipe.sh" But I got an error:

fairseq-train: error: argument --arch/-a: invalid choice: 'deepspeed_roberta_large'

I noticed that you added the prefix "deepspeed_${ARCH}" in run_efficient_mlm_recipe.sh. And this arch is not registered in fairseq I think. Could you please tell me why you add a prefix and how to solve the problem? Thank you!

opened by leoozy 2
Why do you use both layer_norm for embedding and pre-norm at the same time?
Hi,

this is great work that makes BERT pre-training more transparent. Here I have a question about your architecture.

In https://github.com/princeton-nlp/DinkyTrain/blob/main/run_efficient_mlm_recipe.sh, you set the following two flags:

--arch roberta_large --encoder-normalize-before

I understand you want to use pre-norm BERT. However, the default setting for roberta_large is with --layernorm-embedding=True, which means you will use two layernorm layers continuously after the word embedding layer. I think you also need to set --layernorm-embedding=False.
opened by BaohaoLiao 2
Fixes princeton-nlp/DinkyTrain#5

As pointed out in #5, there is a problem in https://github.com/princeton-nlp/DinkyTrain/blob/main/scripts/convert_fs_ckpt_to_hf_ckpt.py that loads the incorrect architecture for models trained with DeepSpeed, even with FROM_DS=1.

This is caused by a misconfiguration where new_ckpt["cfg"]['model'].arch and new_ckpt["cfg"]['model']._name retain values deepspeed_roberta_x causing https://github.com/princeton-nlp/DinkyTrain/blob/1f26b99815547cd09762cd34dc980571d10454a5/scripts/convert_fs_ckpt_to_hf_ckpt.py#L62 to load DeepSpeedRobertaModel instead of RobertaModel.

This PR improves the https://github.com/princeton-nlp/DinkyTrain/blob/main/scripts/convert_dsfs_ckpt_to_fs_ckpt.py script to convert DeepSpeed models to fairseq by removing any deepspeed_ prefix from new_ckpt["cfg"]['model'].arch and new_ckpt["cfg"]['model']._name.

opened by carlosejimenez 0

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

Related tags

Overview

Why DinkyTrain?

Our Contributions

DinkyTrain for Efficient MLM Pre-training

Quick Links

Overview

Installation

Data Pre-processing

Pre-training

Fine-tuning on GLUE and SQuAD

Convert to HuggingFace

Model List

Bugs or Questions?

Citation

Acknowledgment

Comments

Why do you use last checkpoint for validation rather than best checkpoint?

Prediction values of the STS-B test set are not in 0~5

Roberta recipe in your paper is different from the original recipe

Could you please tell how to set the hyparameters of the GLUE?

fairseq-train: error: argument --arch/-a: invalid choice: 'deepspeed_roberta_large'

Why do you use both layer_norm for embedding and pre-norm at the same time?

Fixes princeton-nlp/DinkyTrain#5

Owner

Princeton Natural Language Processing

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Simple and efficient RevNet-Library with DeepSpeed support

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Pre-training BERT masked language models with custom vocabulary

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

iBOT: Image BERT Pre-Training with Online Tokenizer

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

AllenNLP integration for Shiba: Japanese CANINE model