MPNet: Masked and Permuted Pre-training for Language Understanding

Overview

MPNet

MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-training method for language understanding tasks. It solves the problems of MLM (masked language modeling) in BERT and PLM (permuted language modeling) in XLNet and achieves better accuracy.

News: We have updated the pre-trained models now.

Supported Features

  • A unified view and implementation of several pre-training models including BERT, XLNet, MPNet, etc.
  • Code for pre-training and fine-tuning for a variety of language understanding (GLUE, SQuAD, RACE, etc) tasks.

Installation

We implement MPNet and this pre-training toolkit based on the codebase of fairseq. The installation is as follow:

pip install --editable pretraining/
pip install pytorch_transformers==1.0.0 transformers scipy sklearn

Pre-training MPNet

Our model is pre-trained with bert dictionary, you first need to pip install transformers to use bert tokenizer. We provide a script encode.py and a dictionary file dict.txt to tokenize your corpus. You can modify encode.py if you want to use other tokenizers (like roberta).

1) Preprocess data

We choose WikiText-103 as a demo. The running script is as follow:

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

for SPLIT in train valid test; do \
    python MPNet/encode.py \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

Then, we need to binarize data. The command of binarizing data is following:

fairseq-preprocess \
    --only-source \
    --srcdict MPNet/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

2) Pre-train MPNet

The below command is to train a MPNet model:

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

DATA_DIR=data-bin/wikitext-103

fairseq-train --fp16 $DATA_DIR \
    --task masked_permutation_lm --criterion masked_permutation_cross_entropy \
    --arch mpnet_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --input-mode 'mpnet'

Notes: You can replace arch with mpnet_rel_base and add command --mask-whole-words --bpe bert to use relative position embedding and whole word mask.

Notes: You can specify --input-mode as mlm or plm to train masked language model or permutation language model.

Pre-trained models

We have updated the final pre-trained MPNet model for fine-tuning.

You can load the pre-trained MPNet model like this:

from fairseq.models.masked_permutation_net import MPNet
mpnet = MPNet.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data', bpe='bert')
assert isinstance(mpnet.model, torch.nn.Module)

Fine-tuning MPNet on down-streaming tasks

Acknowledgements

Our code is based on fairseq-0.8.0. Thanks for their contribution to the open-source commuity.

Reference

If you find this toolkit useful in your work, you can cite the corresponding papers listed below:

@article{song2020mpnet,
    title={MPNet: Masked and Permuted Pre-training for Language Understanding},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    journal={arXiv preprint arXiv:2004.09297},
    year={2020}
}

Related Works

Comments
  • would you release the code about convert the mpnet to the transformers one?

    would you release the code about convert the mpnet to the transformers one?

    would you release the code about convert the mpnet to the transformers "microsoft/mpnet-base"? I want to convert my pretrain one to the transformers format

    opened by RyanHuangNLP 3
  • Training on SQUAD2 gives less results on Evaluation set (Research Paper shows better results)

    Training on SQUAD2 gives less results on Evaluation set (Research Paper shows better results)

    I tried Training MPNet on SQUAD2 data below is the result I was getting on Evalutionset

    I used this script

    (exact, 50.07159100480081)                                                                                  
    (f1, 50.07159100480081)
    (total, 11873)
    (HasAns_exact, 0.0)
    (HasAns_f1, 0.0)
    (HasAns_total, 5928)
    (NoAns_exact, 100.0)
    (NoAns_f1, 100.0)
    (NoAns_total, 5945)
    (best_exact, 50.07159100480081)
    (best_exact_thresh, 0.0)
    (best_f1, 50.07159100480081)
    (best_f1_thresh, 0.0)
    
    opened by bhadreshpsavani 1
  • input mode questions?

    input mode questions?

    is it input mode use mlm is like roberta, and use plm is like xlnet?

    and by the way, would you provide the script to convert the model format to the huggingface transformers ones?

    opened by RyanHuangNLP 1
  • Inconsistencies between data collator output and masked permute in original paper

    Inconsistencies between data collator output and masked permute in original paper

    Hi all on the MPNet research team,

    I am in the process of converting the fairseq training code for MPNet into a training loop that is compatible with Huggingface. Although many of the convenience classes already exist in Huggingface (like MPNetForMaskedLM), one thing that has become clear to us is that we will need to port over the collator function in MaskedDataset (under tasks/masked_permutation_lm).

    In exploring how this collator works, I understand the logic as:

    1. Permute input IDs (based on whole word spans or tokens via arg) and positions
    2. Create masked/corrupted tokens based on the final n indices of the permuted sequence, where n is the prediction size (i.e. seq_len x 0.15 at default values)
    3. Concat these together using concat(seq, mask, mask) and concat(positions, predict_positions, predict_positions)

    Using this logic, we might expect the collator function to perform the below operation on some dummy input IDs:

    src_tokens = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
    
    # Once the collator permutes everything and we append the mask portions, we expect something like
    new_ids = [ 20,  23,  30,  14,  15,  27,  28,  11,  12,  17,  18,  26,  29,  13, 10,  19,  21,  22,  16,  24,  25, <mask>,  <corrupted>, <mask>, <mask>,  <corrupted>, <mask>]
    new_positions = [10, 13, 20,  4,  5, 17, 18,  1,  2,  7,  8, 16, 19,  3,  0,  9, 11, 12, 6, 14, 15,  6, 14, 15,  6, 14, 15]
    

    However, after rereading the MPNet paper, especially section 2.2 and 2.3 with attention on Figure 2, it would SEEM that the output of the collator is incongruous with what is described in these sections.

    Figure 2 points out that the content and query masks are built using a permuted sequence that looks like:

    src_tokens = [x_1, x_2, x_3, x_4, x_5, x_6]
    
    # Once permuted we get:
    new_ids = [x_1, x_3, x_5, <mask>, <mask>, <mask>,  x_4, x_6, x_2]
    new_positions = [1, 3, 5, 4, 6, 2, 4, 6, 2]
    

    In this example within the paper, we are masking the pred_len tokens and then appending the content to the end for the content stream. However, the collator output KEEPS the token content in the main sequence, and then adds TWO batches of mask tokens to the end, which to me seems necessarily different than what's described in the paper. Referring back to our dummy example above, I can outline the discrepancies I'm seeing:

    collator_ids = [ 20,  23,  30,  14,  15,  27,  28,  11,  12,  17,  18,  26,  29,  13, 10,  19,  21,  22,  16,  24,  25, <mask>,  <corrupted>, <mask>, <mask>,  <corrupted>, <mask>]
    collator_positions = [10, 13, 20,  4,  5, 17, 18,  1,  2,  7,  8, 16, 19,  3,  0,  9, 11, 12, 6, 14, 15,  6, 14, 15,  6, 14, 15]
    
    paper_ids = [ 20,  23,  30,  14,  15,  27,  28,  11,  12,  17,  18,  26,  29,  13, 10,  19,  21,  22, <mask>,  <corrupted>, <mask>, 16, 24, 25]
    paper_positions = [10, 13, 20,  4,  5, 17, 18,  1,  2,  7,  8, 16, 19,  3,  0,  9, 11, 12, 6, 14, 15,  6, 14, 15]
    

    My question, then, is this: am I correct in understanding that the collator implementation is different than what's described in the paper? If so, why?

    opened by alex-barbet 0
  • How to use deepspeed?

    How to use deepspeed?

    https://github.com/microsoft/DeepSpeed MPNet suffer from slow training time, deepspeed could significantly reduce the time needed and transformers (hugginface) support it apparently, any guide/sample code from how to enable it for MPNet?

    opened by LifeIsStrange 0
  • The future is to combine MPNet with other language models innovations

    The future is to combine MPNet with other language models innovations

    For example, it could really make sense to adapt MPNet to preserve PLM but uses the approach of ELECTRA for MLM. SpanBERT has some potential too (e.g on coreference resolution) I believe this could really push the state of the art of accuracy on key tasks.

    What do you think? @StillKeepTry @tan-xu

    Moreover there are important low hanging fruits that have been consistently ignored by transformer researchers:

    The activation function used should probably be https://github.com/digantamisra98/Mish as it is the one that give the most accuracy gains in general. It can give 1% accuracy gains which is huge.

    Secondly the optimizer you're using, Adam is flawed and you should use its rectified version: https://github.com/LiyuanLucasLiu/RAdam Moreover it can be optionally combined with a complementary optimizer: https://github.com/michaelrzhang/lookahead

    Moreover there are newer techniques for training that yield significant accuracy gains, such as: https://github.com/Yonghongwei/Gradient-Centralization And gradient normalization.

    There is a library that integrate all those advances and more here: https://github.com/lessw2020/Ranger21

    Accuracy gains in NLP/NLU have reached a plateau. The reason is that researchers works far too much in isolation. They bring N new innovations per years but the number of researchers that attempt to use those innovations/optimization together can be counted on the fingers of one hand.

    XLnet has been consistently ignored by researchers, you are the ones that saw the opportunity to combine the best of both worlds of BERT and XLnet. Why stop there? As I said, both transformer/language model wise and activation function/optimizer wise there are a LOT of significant accuracy optimizations to integrate into the successor of MPNet. Aggregating those optimizations could yield a revolutionary language model that would have 5-10% accuracy gains on average over existing SOTA. It would mark history. No one will attempt to combine a wide range of those innovations, you are the only hope. I you do not do it, I'm afraid no one else will and NLU will stagnate for the decade to come.

    opened by LifeIsStrange 0
  • How to continue pretraining from the released checkpoint?

    How to continue pretraining from the released checkpoint?

    Hello, Thank you for releasing the codes for pretraining MPNet! I am trying to continue training of the language model task on a custom dataset from the released checkpoint using the --restore-file argument. However, I am not able to successfully load the checkpoint. It fails with the following error: MPNet/pretraining/fairseq/checkpoint_utils.py", line 307, in _upgrade_state _dict registry.set_defaults(state['args'], tasks.TASK_REGISTRY[state['args'].task]) KeyError: 'mixed_position_lm'

    In case it helps, here is the details of the training command :

    WARMUP_UPDATES=50000    # Warmup the learning rate over this many updates
    PEAK_LR=0.0005          # Peak learning rate, adjust as needed
    TOKENS_PER_SAMPLE=512   # Max sequence length
    MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
    MAX_SENTENCES=35        # Number of sequences per batch (batch size)
    UPDATE_FREQ=16          # Increase the batch size 16x
    
    DATA_DIR=data-bin
    
    fairseq-train --fp16 $DATA_DIR \
      --task masked_permutation_lm --criterion masked_permutation_cross_entropy \
      --arch mpnet_base --sample-break-mode none --tokens-per-sample $TOKENS_PER_SAMPLE \
      --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
      --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES \ 
      --total-num-update $TOTAL_UPDATES   --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
      --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ --skip-invalid-size-inputs-valid-test \
      --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --input-mode 'mpnet'\
      --restore-file mpnet.base/mpnet.pt --save-interval-updates 10 --ddp-backend no_c10d
    

    I will appreciate insights on what to do to resolve this error. Thank you!

    opened by ast123 0
  • Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.

    Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.

    When running the Training script for SQUAD I was getting the below error.

    Traceback (most recent call last):
      File "/media/data2/anaconda/envs/mpnet/bin/fairseq-train", line 33, in <module>
        sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
      File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq_cli/train.py", line 370, in cli_main
        main(args)
      File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq_cli/train.py", line 47, in main
        task = tasks.setup_task(args)
      File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/__init__.py", line 17, in setup_task
        return TASK_REGISTRY[args.task].setup_task(args, **kwargs)
      File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 104, in setup_task
        return cls(args, dictionary)
      File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 84, in __init__
        self.tokenizer = SQuADTokenizer(args.bpe_vocab_file, dictionary)
      File "/media/data1/bhadresh/MPNet/MPNet/pretraining/fairseq/tasks/squad2.py", line 42, in __init__
        self.max_len_single_sentence = self.max_len - 2
      File "/media/data2/anaconda/envs/mpnet/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1547, in max_len_single_sentence
        raise ValueError(
    ValueError: Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.
    

    By commenting out line 42 and 43 in file

     self.max_len_single_sentence = self.max_len - 2
     self.max_len_sentences_pair = self.max_len - 3
    

    It resolves but is it fine to do so?

    When I run the script, I was getting less F1 score and Exact Match than mentioned in the Paper. I also created an issue for that

    opened by bhadreshpsavani 1
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
Watson Natural Language Understanding and Knowledge Studio

Material de demonstração dos serviços: Watson Natural Language Understanding e Knowledge Studio Visão Geral: https://www.ibm.com/br-pt/cloud/watson-na

Vanderlei Munhoz 4 Oct 24, 2021
Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

Peter Prettenhofer 119 Oct 8, 2022
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

null 74 Dec 13, 2022
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 9, 2023
TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Yixuan Su 26 Oct 17, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 6, 2023
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 9, 2022
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022