MPNet: Masked and Permuted Pre-training for Language Understanding

Microsoft

Last update: Nov 21, 2022

Related tags

Text Data & NLP MPNet

Overview

MPNet

MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-training method for language understanding tasks. It solves the problems of MLM (masked language modeling) in BERT and PLM (permuted language modeling) in XLNet and achieves better accuracy.

News: We have updated the pre-trained models now.

Supported Features

A unified view and implementation of several pre-training models including BERT, XLNet, MPNet, etc.
Code for pre-training and fine-tuning for a variety of language understanding (GLUE, SQuAD, RACE, etc) tasks.

Installation

We implement MPNet and this pre-training toolkit based on the codebase of fairseq. The installation is as follow:

pip install --editable pretraining/
pip install pytorch_transformers==1.0.0 transformers scipy sklearn

Pre-training MPNet

Our model is pre-trained with bert dictionary, you first need to pip install transformers to use bert tokenizer. We provide a script encode.py and a dictionary file dict.txt to tokenize your corpus. You can modify encode.py if you want to use other tokenizers (like roberta).

1) Preprocess data

We choose WikiText-103 as a demo. The running script is as follow:

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

for SPLIT in train valid test; do \
    python MPNet/encode.py \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

Then, we need to binarize data. The command of binarizing data is following:

fairseq-preprocess \
    --only-source \
    --srcdict MPNet/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

2) Pre-train MPNet

The below command is to train a MPNet model:

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin/wikitext-103

fairseq-train --fp16 $DATA_DIR \
    --task masked_permutation_lm --criterion masked_permutation_cross_entropy \
    --arch mpnet_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --input-mode 'mpnet'

Notes: You can replace arch with mpnet_rel_base and add command --mask-whole-words --bpe bert to use relative position embedding and whole word mask.

Notes: You can specify --input-mode as mlm or plm to train masked language model or permutation language model.

Pre-trained models

We have updated the final pre-trained MPNet model for fine-tuning.

You can load the pre-trained MPNet model like this:

from fairseq.models.masked_permutation_net import MPNet
mpnet = MPNet.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data', bpe='bert')
assert isinstance(mpnet.model, torch.nn.Module)

Fine-tuning MPNet on down-streaming tasks

Acknowledgements

Our code is based on fairseq-0.8.0. Thanks for their contribution to the open-source commuity.

Reference

If you find this toolkit useful in your work, you can cite the corresponding papers listed below:

@article{song2020mpnet,
    title={MPNet: Masked and Permuted Pre-training for Language Understanding},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    journal={arXiv preprint arXiv:2004.09297},
    year={2020}
}

Related Works

MASS: Masked Sequence to Sequence Pre-training for Language Generation, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. GitHub: https://github.com/microsoft/MASS
LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning, by Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, Tie-Yan Liu

Comments

would you release the code about convert the mpnet to the transformers one?

would you release the code about convert the mpnet to the transformers "microsoft/mpnet-base"? I want to convert my pretrain one to the transformers format

opened by RyanHuangNLP 3

Training on SQUAD2 gives less results on Evaluation set (Research Paper shows better results)

I tried Training MPNet on SQUAD2 data below is the result I was getting on Evalutionset

I used this script

(exact, 50.07159100480081)                                                                                  
(f1, 50.07159100480081)
(total, 11873)
(HasAns_exact, 0.0)
(HasAns_f1, 0.0)
(HasAns_total, 5928)
(NoAns_exact, 100.0)
(NoAns_f1, 100.0)
(NoAns_total, 5945)
(best_exact, 50.07159100480081)
(best_exact_thresh, 0.0)
(best_f1, 50.07159100480081)
(best_f1_thresh, 0.0)

opened by bhadreshpsavani 1

input mode questions?

is it input mode use mlm is like roberta, and use plm is like xlnet?

and by the way, would you provide the script to convert the model format to the huggingface transformers ones?

opened by RyanHuangNLP 1
Inconsistencies between data collator output and masked permute in original paper
Hi all on the MPNet research team,

I am in the process of converting the fairseq training code for MPNet into a training loop that is compatible with Huggingface. Although many of the convenience classes already exist in Huggingface (like MPNetForMaskedLM), one thing that has become clear to us is that we will need to port over the collator function in MaskedDataset (under tasks/masked_permutation_lm).

In exploring how this collator works, I understand the logic as:

Permute input IDs (based on whole word spans or tokens via arg) and positions

Create masked/corrupted tokens based on the final n indices of the permuted sequence, where n is the prediction size (i.e. seq_len x 0.15 at default values)

Concat these together using concat(seq, mask, mask) and concat(positions, predict_positions, predict_positions)

Using this logic, we might expect the collator function to perform the below operation on some dummy input IDs:

src_tokens = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30] # Once the collator permutes everything and we append the mask portions, we expect something like new_ids = [ 20, 23, 30, 14, 15, 27, 28, 11, 12, 17, 18, 26, 29, 13, 10, 19, 21, 22, 16, 24, 25, <mask>, <corrupted>, <mask>, <mask>, <corrupted>, <mask>] new_positions = [10, 13, 20, 4, 5, 17, 18, 1, 2, 7, 8, 16, 19, 3, 0, 9, 11, 12, 6, 14, 15, 6, 14, 15, 6, 14, 15]

However, after rereading the MPNet paper, especially section 2.2 and 2.3 with attention on Figure 2, it would SEEM that the output of the collator is incongruous with what is described in these sections.

Figure 2 points out that the content and query masks are built using a permuted sequence that looks like:

src_tokens = [x_1, x_2, x_3, x_4, x_5, x_6] # Once permuted we get: new_ids = [x_1, x_3, x_5, <mask>, <mask>, <mask>, x_4, x_6, x_2] new_positions = [1, 3, 5, 4, 6, 2, 4, 6, 2]

In this example within the paper, we are masking the pred_len tokens and then appending the content to the end for the content stream. However, the collator output KEEPS the token content in the main sequence, and then adds TWO batches of mask tokens to the end, which to me seems necessarily different than what's described in the paper. Referring back to our dummy example above, I can outline the discrepancies I'm seeing:

collator_ids = [ 20, 23, 30, 14, 15, 27, 28, 11, 12, 17, 18, 26, 29, 13, 10, 19, 21, 22, 16, 24, 25, <mask>, <corrupted>, <mask>, <mask>, <corrupted>, <mask>] collator_positions = [10, 13, 20, 4, 5, 17, 18, 1, 2, 7, 8, 16, 19, 3, 0, 9, 11, 12, 6, 14, 15, 6, 14, 15, 6, 14, 15] paper_ids = [ 20, 23, 30, 14, 15, 27, 28, 11, 12, 17, 18, 26, 29, 13, 10, 19, 21, 22, <mask>, <corrupted>, <mask>, 16, 24, 25] paper_positions = [10, 13, 20, 4, 5, 17, 18, 1, 2, 7, 8, 16, 19, 3, 0, 9, 11, 12, 6, 14, 15, 6, 14, 15]

My question, then, is this: am I correct in understanding that the collator implementation is different than what's described in the paper? If so, why?
opened by alex-barbet 0
How to use deepspeed?

https://github.com/microsoft/DeepSpeed MPNet suffer from slow training time, deepspeed could significantly reduce the time needed and transformers (hugginface) support it apparently, any guide/sample code from how to enable it for MPNet?

opened by LifeIsStrange 0
The future is to combine MPNet with other language models innovations

For example, it could really make sense to adapt MPNet to preserve PLM but uses the approach of ELECTRA for MLM. SpanBERT has some potential too (e.g on coreference resolution) I believe this could really push the state of the art of accuracy on key tasks.

What do you think? @StillKeepTry @tan-xu

Moreover there are important low hanging fruits that have been consistently ignored by transformer researchers:

The activation function used should probably be https://github.com/digantamisra98/Mish as it is the one that give the most accuracy gains in general. It can give 1% accuracy gains which is huge.

Secondly the optimizer you're using, Adam is flawed and you should use its rectified version: https://github.com/LiyuanLucasLiu/RAdam Moreover it can be optionally combined with a complementary optimizer: https://github.com/michaelrzhang/lookahead

Moreover there are newer techniques for training that yield significant accuracy gains, such as: https://github.com/Yonghongwei/Gradient-Centralization And gradient normalization.

There is a library that integrate all those advances and more here: https://github.com/lessw2020/Ranger21

Accuracy gains in NLP/NLU have reached a plateau. The reason is that researchers works far too much in isolation. They bring N new innovations per years but the number of researchers that attempt to use those innovations/optimization together can be counted on the fingers of one hand.

XLnet has been consistently ignored by researchers, you are the ones that saw the opportunity to combine the best of both worlds of BERT and XLnet. Why stop there? As I said, both transformer/language model wise and activation function/optimizer wise there are a LOT of significant accuracy optimizations to integrate into the successor of MPNet. Aggregating those optimizations could yield a revolutionary language model that would have 5-10% accuracy gains on average over existing SOTA. It would mark history. No one will attempt to combine a wide range of those innovations, you are the only hope. I you do not do it, I'm afraid no one else will and NLU will stagnate for the decade to come.

opened by LifeIsStrange 0

How to continue pretraining from the released checkpoint?

Hello, Thank you for releasing the codes for pretraining MPNet! I am trying to continue training of the language model task on a custom dataset from the released checkpoint using the --restore-file argument. However, I am not able to successfully load the checkpoint. It fails with the following error: MPNet/pretraining/fairseq/checkpoint_utils.py", line 307, in _upgrade_state _dict registry.set_defaults(state['args'], tasks.TASK_REGISTRY[state['args'].task]) KeyError: 'mixed_position_lm'

In case it helps, here is the details of the training command :

WARMUP_UPDATES=50000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=35        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin

fairseq-train --fp16 $DATA_DIR \
  --task masked_permutation_lm --criterion masked_permutation_cross_entropy \
  --arch mpnet_base --sample-break-mode none --tokens-per-sample $TOKENS_PER_SAMPLE \
  --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
  --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES \ 
  --total-num-update $TOTAL_UPDATES   --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
  --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ --skip-invalid-size-inputs-valid-test \
  --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --input-mode 'mpnet'\
  --restore-file mpnet.base/mpnet.pt --save-interval-updates 10 --ddp-backend no_c10d

I will appreciate insights on what to do to resolve this error. Thank you!

opened by ast123 0

Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.

When running the Training script for SQUAD I was getting the below error.

Traceback (most recent call last):
  File "/media/data2/anaconda/envs/mpnet/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq_cli/train.py", line 370, in cli_main
    main(args)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq_cli/train.py", line 47, in main
    task = tasks.setup_task(args)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/__init__.py", line 17, in setup_task
    return TASK_REGISTRY[args.task].setup_task(args, **kwargs)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 104, in setup_task
    return cls(args, dictionary)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 84, in __init__
    self.tokenizer = SQuADTokenizer(args.bpe_vocab_file, dictionary)
  File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 42, in __init__
    self.max_len_single_sentence = self.max_len - 2
  File "/media/data2/anaconda/envs/mpnet/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1547, in max_len_single_sentence
    raise ValueError(
ValueError: Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.

By commenting out line 42 and 43 in file

 self.max_len_single_sentence = self.max_len - 2
 self.max_len_sentences_pair = self.max_len - 3

It resolves but is it fine to do so?

When I run the script, I was getting less F1 score and Exact Match than mentioned in the Paper. I also created an issue for that

opened by bhadreshpsavani 1

MPNet: Masked and Permuted Pre-training for Language Understanding

Related tags

Overview

MPNet

Supported Features

Installation

Pre-training MPNet

1) Preprocess data

2) Pre-train MPNet

Pre-trained models

Fine-tuning MPNet on down-streaming tasks

Acknowledgements

Reference

Related Works

Comments

Owner

Microsoft

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Watson Natural Language Understanding and Knowledge Studio

Natural language Understanding Toolkit

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

iBOT: Image BERT Pre-Training with Online Tokenizer

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.