MASS: Masked Sequence to Sequence Pre-training for Language Generation

Related tags

Text Data & NLP MASS
Overview

MASS

MASS: Masked Sequence to Sequence Pre-training for Language Generation, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-training method for sequence to sequence based language generation tasks. It randomly masks a sentence fragment in the encoder, and then predicts it in the decoder.

img

MASS can be applied on cross-lingual tasks such as neural machine translation (NMT), and monolingual tasks such as text summarization. The current codebase supports unsupervised NMT (implemented based on XLM), supervised NMT, text summarization and conversational response generation, which are all based on Fairseq. We will release our implementation for other sequence to sequence generation tasks in the future.

What is New!

We release MPNet, a new pre-trained method for language understanding. GitHub: https://github.com/microsoft/MPNet

Unsupervised NMT

Unsupervised Neural Machine Translation just uses monolingual data to train the models. During MASS pre-training, the source and target languages are pre-trained in one model, with the corresponding langauge embeddings to differentiate the langauges. During MASS fine-tuning, back-translation is used to train the unsupervised models. Code is under MASS-unsupNMT. We provide pre-trained and fine-tuned models:

Languages Pre-trained Model Fine-tuned Model BPE codes Vocabulary
EN - FR MODEL MODEL BPE codes Vocabulary
EN - DE MODEL MODEL BPE codes Vocabulary
En - RO MODEL MODEL BPE_codes Vocabulary

We are also preparing larger models on more language pairs, and will release them in the future.

Dependencies

Currently we implement MASS for unsupervised NMT based on the codebase of XLM. The depencies are as follows:

  • Python 3
  • NumPy
  • PyTorch (version 0.4 and 1.0)
  • fastBPE (for BPE codes)
  • Moses (for tokenization)
  • Apex (for fp16 training)

Data Ready

We use the same BPE codes and vocabulary with XLM. Here we take English-French as an example.

cd MASS

wget https://dl.fbaipublicfiles.com/XLM/codes_enfr
wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr

./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr

Pre-training:

python train.py                                      \
--exp_name unsupMT_enfr                              \
--data_path ./data/processed/en-fr/                  \
--lgs 'en-fr'                                        \
--mass_steps 'en,fr'                                 \
--encoder_only false                                 \
--emb_dim 1024                                       \
--n_layers 6                                         \
--n_heads 8                                          \
--dropout 0.1                                        \
--attention_dropout 0.1                              \
--gelu_activation true                               \
--tokens_per_batch 3000                              \
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
--epoch_size 200000                                  \
--max_epoch 100                                      \
--eval_bleu true                                     \
--word_mass 0.5                                      \
--min_len 5                                          \

During the pre-training prcess, even without any back-translation, you can observe the model can achieve some intial BLEU scores:

epoch -> 4
valid_fr-en_mt_bleu -> 10.55
valid_en-fr_mt_bleu ->  7.81
test_fr-en_mt_bleu  -> 11.72
test_en-fr_mt_bleu  ->  8.80

Distributed Training

To use multiple GPUs e.g. 3 GPUs on same node

export NGPU=3; CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node=$NGPU train.py [...args]

To use multiple GPUS across many nodes, use Slurm to request multi-node job and launch the above command. The code automatically detects the SLURM_* environment vars to distribute the training.

Fine-tuning

After pre-training, we use back-translation to fine-tune the pre-trained model on unsupervised machine translation:

MODEL=mass_enfr_1024.pth

python train.py \
  --exp_name unsupMT_enfr                              \
  --data_path ./data/processed/en-fr/                  \
  --lgs 'en-fr'                                        \
  --bt_steps 'en-fr-en,fr-en-fr'                       \
  --encoder_only false                                 \
  --emb_dim 1024                                       \
  --n_layers 6                                         \
  --n_heads 8                                          \
  --dropout 0.1                                        \
  --attention_dropout 0.1                              \
  --gelu_activation true                               \
  --tokens_per_batch 2000                              \
  --batch_size 32	                                     \
  --bptt 256                                           \
  --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
  --epoch_size 200000                                  \
  --max_epoch 30                                       \
  --eval_bleu true                                     \
  --reload_model "$MODEL,$MODEL"                       \

We also provide a demo to use MASS pre-trained model on the WMT16 en-ro bilingual dataset. We provide pre-trained and fine-tuned models:

Model Ro-En BLEU (with BT)
Baseline 34.0
XLM 38.5
MASS 39.1

Download dataset by the below command:

wget https://dl.fbaipublicfiles.com/XLM/codes_enro
wget https://dl.fbaipublicfiles.com/XLM/vocab_enro

./get-data-bilingual-enro-nmt.sh --src en --tgt ro --reload_codes codes_enro --reload_vocab vocab_enro

After download the mass pre-trained model from the above link. And use the following command to fine tune:

MODEL=mass_enro_1024.pth

python train.py \
	--exp_name unsupMT_enro                              \
	--data_path ./data/processed/en-ro                   \
	--lgs 'en-ro'                                        \
	--bt_steps 'en-ro-en,ro-en-ro'                       \
	--encoder_only false                                 \
	--mt_steps 'en-ro,ro-en'                             \
	--emb_dim 1024                                       \
	--n_layers 6                                         \
	--n_heads 8                                          \
	--dropout 0.1                                        \
	--attention_dropout 0.1                              \
	--gelu_activation true                               \
	--tokens_per_batch 2000                              \
	--batch_size 32                                      \
	--bptt 256                                           \
	--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
	--epoch_size 200000                                  \
	--max_epoch 50                                       \
	--eval_bleu true                                     \
	--reload_model "$MODEL,$MODEL"

Supervised NMT

We also implement MASS on fairseq, in order to support the pre-training and fine-tuning for large scale supervised tasks, such as neural machine translation, text summarization. Unsupervised pre-training usually works better in zero-resource or low-resource downstream tasks. However, in large scale supervised NMT, there are plenty of bilingual data, which brings challenges for conventional unsupervised pre-training. Therefore, we design new pre-training loss to support large scale supervised NMT. The code is under MASS-supNMT.

We extend the MASS to supervised setting where the supervised sentence pair (X, Y) is leveraged for pre-training. The sentence X is masked and feed into the encoder, and the decoder predicts the whole sentence Y. Some discret tokens in the decoder input are also masked, to encourage the decoder to extract more informaiton from the encoder side.
img

During pre-training, we combine the orignal MASS pre-training loss and the new supervised pre-training loss together. During fine-tuning, we directly use supervised sentence pairs to fine-tune the pre-trained model. Except for NMT, this pre-trainig paradigm can be also applied on other superviseed sequence to sequence tasks.

We release the pre-trained model and example codes of how to pre-train and fine-tune on WMT Chinese<->English (Zh<->En) translation.:

Languages Pre-trained Model BPE codes English-Dict Chinese-Dict
Zh - En MODEL CODE VOCAB VOCAB

Prerequisites

After download the repository, you need to install fairseq by pip:

pip install fairseq==0.7.1

Data Ready

We first prepare the monolingual and bilingual sentences for Chinese and English respectively. The data directory looks like:

- data/
  ├─ mono/
  |  ├─ train.en
  |  ├─ train.zh
  |  ├─ valid.en
  |  ├─ valid.zh
  |  ├─ dict.en.txt
  |  └─ dict.zh.txt
  └─ para/
     ├─ train.en
     ├─ train.zh
     ├─ valid.en
     ├─ valid.zh
     ├─ dict.en.txt
     └─ dict.zh.txt

The files under mono are monolingual data, while under para are bilingual data. dict.en(zh).txt in different directory should be identical. The dictionary for different language can be different. Running the following command can generate the binarized data:

# Ensure the output directory exists
data_dir=data/
mono_data_dir=$data_dir/mono/
para_data_dir=$data_dir/para/
save_dir=$data_dir/processed/

# set this relative path of MASS in your server
user_dir=mass

mkdir -p $data_dir $save_dir $mono_data_dir $para_data_dir


# Generate Monolingual Data
for lg in en zh
do

  fairseq-preprocess \
  --task cross_lingual_lm \
  --srcdict $mono_data_dir/dict.$lg.txt \
  --only-source \
  --trainpref $mono_data_dir/train --validpref $mono_data_dir/valid \
  --destdir $save_dir \
  --workers 20 \
  --source-lang $lg

  # Since we only have a source language, the output file has a None for the
  # target language. Remove this

  for stage in train valid
  do
    mv $save_dir/$stage.$lg-None.$lg.bin $save_dir/$stage.$lg.bin
    mv $save_dir/$stage.$lg-None.$lg.idx $save_dir/$stage.$lg.idx
  done
done

# Generate Bilingual Data
fairseq-preprocess \
  --user-dir $mass_dir \
  --task xmasked_seq2seq \
  --source-lang en --target-lang zh \
  --trainpref $para_data_dir/train --validpref $para_data_dir/valid \
  --destdir $save_dir \
  --srcdict $para_data_dir/dict.en.txt \
  --tgtdict $para_data_dir/dict.zh.txt

Pre-training

We provide a simple demo code to demonstrate how to deploy mass pre-training.

save_dir=checkpoints/mass/pre-training/
user_dir=mass
data_dir=data/processed/

mkdir -p $save_dir

fairseq-train $data_dir \
    --user-dir $user_dir \
    --save-dir $save_dir \
    --task xmasked_seq2seq \
    --source-langs en,zh \
    --target-langs en,zh \
    --langs en,zh \
    --arch xtransformer \
    --mass_steps en-en,zh-zh \
    --memt_steps en-zh,zh-en \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --lr 0.00005 --min-lr 1e-09 \
    --criterion label_smoothed_cross_entropy \
    --max-tokens 4096 \
    --dropout 0.1 --relu-dropout 0.1 --attention-dropout 0.1 \
    --max-update 100000 \
    --share-decoder-input-output-embed \
    --valid-lang-pairs en-zh \

We also provide a pre-training script which is used for our released model.

Fine-tuning

After pre-training stage, we fine-tune the model on bilingual sentence pairs:

data_dir=data/processed
save_dir=checkpoints/mass/fine_tune/
user_dir=mass
model=checkpoint/mass/pre-training/checkpoint_last.pt # The path of pre-trained model

mkdir -p $save_dir

fairseq-train $data_dir \
    --user-dir $user_dir \
    --task xmasked_seq2seq \
    --source-langs zh --target-langs en \
    --langs en,zh \
    --arch xtransformer \
    --mt_steps zh-en \
    --save-dir $save_dir \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --lr-shrink 0.5 --lr 0.00005 --min-lr 1e-09 \
    --criterion label_smoothed_cross_entropy \
    --max-tokens 4096 \
    --max-update 100000 --max-epoch 50 \
    --dropout 0.1 --relu-dropout 0.1 --attention-dropout 0.1 \
    --share-decoder-input-output-embed \
    --valid-lang-pairs zh-en \
    --reload_checkpoint $model

We also provide a fine-tuning script which is used for our pre-trained model.

Inference

After the fine-tuning stage, you can generate translation results by using the below script:

model=checkpoints/mass/fine_tune/checkpoint_best.pt
data_dir=data/processed
user_dir=mass

fairseq-generate $data_dir \
    --user-dir $user_dir \
    -s zh -t en \
    --langs en,zh \
    --source-langs zh --target-langs en \
    --mt_steps zh-en \
    --gen-subset valid \
    --task xmasked_seq2seq \
    --path $model \
    --beam 5 --remove-bpe 

Text Summarization

MASS for text summarization is also implemented on fairseq. The code is under MASS-summarization.

Dependency

pip install torch==1.0.0 
pip install fairseq==0.8.0

MODEL

MASS uses default Transformer structure. We denote L, H, A as the number of layers, the hidden size and the number of attention heads.

Model Encoder Decoder Download
MASS-base-uncased 6L-768H-12A 6L-768H-12A MODEL
MASS-middle-uncased 6L-1024H-16A 6L-1024H-16A MODEL

Results on Abstractive Summarization (12/03/2019)

Dataset RG-1 RG-2 RG-L
CNN/Daily Mail 43.05 20.02 40.08
Gigaword 38.93 20.20 36.20
XSum 39.75 17.24 31.95

Evaluated by files2rouge.

Pipeline for Pre-Training

Download data

Our model is trained on Wikipekia + BookCorpus. Here we use wikitext-103 to demonstrate how to process data.

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

Tokenize corpus

We use wordpiece vocabuary (from bert) to tokenize the original text data directly. We provide a script to deal with data. You need to pip install pytorch_transformers first to generate tokenized data.

mkdir -p mono
for SPLIT in train valid test; do 
    python encode.py \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs mono/${SPLIT}.txt \
        --workers 60; \
done 

Binarized data

wget -c https://modelrelease.blob.core.windows.net/mass/mass-base-uncased.tar.gz
tar -zxvf mass-base-uncased.tar.gz
# Move dict.txt from tar file to the data directory 

fairseq-preprocess \
    --user-dir mass --only-source \
    --trainpref mono/train.txt --validpref mono/valid.txt --testpref mono/test.txt \
    --destdir processed --srcdict dict.txt --workers 60

Pre-training

TOKENS_PER_SAMPLE=512
WARMUP_UPDATES=10000
PEAK_LR=0.0005
TOTAL_UPDATES=125000
MAX_SENTENCES=8
UPDATE_FREQ=16

fairseq-train processed \
    --user-dir mass --task masked_s2s --arch transformer_mass_base \
    --sample-break-mode none \
    --tokens-per-sample $TOKENS_PER_SAMPLE \
    --criterion masked_lm \
    --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --ddp-backend=no_c10d \

Pipeline for Fine-tuning (CNN / Daily Mail)

Data

Download, tokenize and truncate data from this link, and use the above tokenization to generate wordpiece-level data. Rename the shuffix article and title as src and tgt. Assume the tokenized data is under cnndm/para

fairseq-preprocess \
    --user-dir mass --task masked_s2s \
    --source-lang src --target-lang tgt \
    --trainpref cnndm/para/train --validpref cnndm/para/valid --testpref cnndm/para/test \
    --destdir cnndm/processed --srcdict dict.txt --tgtdict dict.txt \
    --workers 20

dict.txt is included in mass-base-uncased.tar.gz. A copy of binarized data can be obtained from here.

Running

fairseq-train cnndm/processed/ \
    --user-dir mass --task translation_mass --arch transformer_mass_base \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 0.0005 --min-lr 1e-09 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --update-freq 8 --max-tokens 4096 \
    --ddp-backend=no_c10d --max-epoch 25 \
    --max-source-positions 512 --max-target-positions 512 \
    --skip-invalid-size-inputs-valid-test \
    --load-from-pretrained-model mass-base-uncased.pt \

lr=0.0005 is not the optimal choice for any task. It is tuned on the dev set (among 1e-4, 2e-4, 5e-4).

Inference

MODEL=checkpoints/checkpoint_best.pt
fairseq-generate $DATADIR --path $MODEL \
    --user-dir mass --task translation_mass \
    --batch-size 64 --beam 5 --min-len 50 --no-repeat-ngram-size 3 \
    --lenpen 1.0 \

min-len is sensitive for different tasks, lenpen needs to be tuned on the dev set.

Reference

If you find MASS useful in your work, you can cite the paper as below:

@inproceedings{song2019mass,
    title={MASS: Masked Sequence to Sequence Pre-training for Language Generation},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    booktitle={International Conference on Machine Learning},
    pages={5926--5936},
    year={2019}
}

Related Works

Comments
  • How can I reproduce the en-fr results of unsupervised NMT?

    How can I reproduce the en-fr results of unsupervised NMT?

    I used the MASS model you provided to fine-tune the en-fr's unsupervised system, but the the results are more than 1BLEU lower than the score reported in the paper.
    My training script is as follow .

     3 data=/search/odin/mmyin/XLM/data/processed/en-fr
      4 mass_model=/search/odin/mmyin/MASS/Unsupervised/MASS-EN-FR/mass_enfr_1024.pth
      5 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
      6 export NGPU=8
      7 
      8 python -m torch.distributed.launch --nproc_per_node=$NGPU train.py\
      9     --exp_name unsupMT_enfr                              \
     10     --exp_id en-fr_1024 \
     11     --data_path $data \
     12     --reload_model $mass_model,$mass_model \
     13     --lgs 'en-fr'                                        \
     14     --bt_steps 'en-fr-en,fr-en-fr'                       \
     15     --encoder_only false                                 \
     16     --emb_dim 1024                                       \
     17     --n_layers 6                                         \
     18     --n_heads 8                                          \
     19     --dropout 0.1                                        \
     20     --attention_dropout 0.1                              \
     21     --gelu_activation true                               \
     22     --tokens_per_batch 1000                              \
     23     --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
     24     --epoch_size 200000                                  \
     25     --max_epoch 50                                      \
     26     --eval_bleu true                                     \
     27     --word_mass 0.5                                      \
     28     --min_len 5                                          \
     29     --stopping_criterion 'valid_en-fr_mt_bleu,10'      \
     30     --validation_metrics 'valid_en-fr_mt_bleu'          \
    ~                                                                                                                                                             
    ~                                                                               
    

    I used beam search for decoding, and beam-size was set to 10. Here are the reproduced scores.

    UNMT | en-fr | fr-en ---|---|--- MASS|37.5| 34.6| 28.1| 35.0 MASS(our reproduce)|36.16 |33.67

    opened by KelleyYin 15
  • How is evaluation done?

    How is evaluation done?

    Hey @StillKeepTry: I am curious to know how did you get the reference file for evaluation? Did you use the reference file generated by get-data-<xyz>.sh?

    opened by deepaknlp 11
  • Replicating en-fr UNMT with a smaller emb_dim

    Replicating en-fr UNMT with a smaller emb_dim

    I have to use a smaller emb_dim of 512 because my GPU Mem is 12.8G. As a result, the hidden_dim of Transformer would be 4 * 512 = 2048. However, the results seem to be much worse than yours.

    My command of pre-training is: python train.py --exp_name unsupMT_enfr --data_path './data/processed/en-fr/' --lgs 'en-fr' --bt_steps 'en-fr-en,fr-en-fr' --mass_steps 'en,fr' --lambda_bt '0:0,10:0' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --tokens_per_batch 3000 --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --epoch_size 200000 --max_epoch 100 --eval_bleu true --word_mass '0.5' --min_len 5 --exp_id "ht6lz6ziu1"

    My command of fine-tuning is: python train.py --exp_name unsupMT_enfr --data_path './data/processed/en-fr/' --lgs 'en-fr' --bt_steps 'en-fr-en,fr-en-fr' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --tokens_per_batch 2000 --batch_size 32 --bptt 256 --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --epoch_size 200000 --max_epoch 30 --eval_bleu true --reload_model 'mass_enfr_512.pth,mass_enfr_512.pth' --exp_id "w83s8z4nkx"

    Only the emb_dim is changed, but I get the following result:

    pre-train.log:

    2 days, 12:15:39 - log:{"epoch": 99, "valid_fr-en_mt_ppl": 249.58064788484583, "valid_fr-en_mt_acc": 23.941589780961678, "valid_fr-en_mt_bleu": 2.28, "valid_en-fr_mt_ppl": 209.35770915056546, "valid_en-fr_mt_acc": 23.806986740933585, "valid_en-fr_mt_bleu": 2.6, "test_fr-en_mt_ppl": 216.85211556087498, "test_fr-en_mt_acc": 24.96583143507973, "test_fr-en_mt_bleu": 2.53, "test_en-fr_mt_ppl": 180.5567727326117, "test_en-fr_mt_acc": 24.483515007722094, "test_en-fr_mt_bleu": 2.71}

    fine-tune.log:

    2 days, 0:07:54 - log:{"epoch": 29, "valid_fr-en_mt_ppl": 206.69015785463145, "valid_fr-en_mt_acc": 40.49940187275702, "valid_fr-en_mt_bleu": 10.23, "valid_en-fr_mt_ppl": 156.52663532211892, "valid_en-fr_mt_acc": 40.216746382514984, "valid_en-fr_mt_bleu": 10.53, "test_fr-en_mt_ppl": 145.12240020417357, "test_fr-en_mt_acc": 43.62819539357125, "test_fr-en_mt_bleu": 11.71, "test_en-fr_mt_ppl": 109.9388737280739, "test_en-fr_mt_acc": 43.34333102043557, "test_en-fr_mt_bleu": 12.17}

    Do I need to change other hyperparams? Is an emb_dim of 1024 necessary?

    opened by magician-david 10
  • About Fine-tuning for Text Summarization

    About Fine-tuning for Text Summarization

    Hi,

    Thank you for the great work. Recently I tried fine-tuning based on the pre-trained model (https://modelrelease.blob.core.windows.net/mass/mass_summarization_1024.pth).

    I followed the instructions (https://github.com/microsoft/MASS#fine-tuning-2) in the readme file and ran this command on a single GPU machine. After the command finished, I tested the output by running python translate_ensemble.py --exp_name giga_test --src_lang ar --tgt_lang ti --beam 5 --batch_size 1 --model_path ./dumped/mass_summarization/bvk6g6f9xl/checkpoint.pth --output_path ./dumped/mass_summarization/bvk6g6f9xl/output.txt.beam5 < ./data/processed/giga/test.ar-ti.ar. Then I processed the output to remove the BPE mark @@ and tested the ROUGE scores. The ROUGE scores are ROUGE-1 F1=37.2 and ROUGE-2 F1=18.8.

    I think I have missed something important here. Could you please instruct me how to correctly fine-tune the model?

    opened by dcn2020 9
  • File Error When Finetuning Gigawords

    File Error When Finetuning Gigawords

    Hi,

    I downloaded the pre-trained monolingual model for text summarization and preprocessed Gigawords using get-data-gigaword.sh. Then I am trying to finetune it following https://github.com/microsoft/MASS#fine-tuning-2 However, I got a file related error:

    Traceback (most recent call last):
      File "train.py", line 345, in <module>
        check_data_params(params)
      File "/workspace/MASS/code/MASS/MASS/src/data/loader.py", line 359, in check_data_params
        assert all([all([os.path.isfile(p1) and os.path.isfile(p2) for p1, p2 in paths.values()]) for paths in params.para_dataset.values()])
    AssertionError
    

    I changed the data path to ./data/processed/giga/ since in the get-data-gigaword.sh file the output folder is giga instead of summarization (https://github.com/microsoft/MASS/blob/438be84c5a82b01a63208f4c42073f13a30d88fd/MASS/get-data-gigaword.sh#L46).

    Could you please help with this?

    Thanks.

    opened by magic282 9
  • How to reload checkpoint for UNMT?

    How to reload checkpoint for UNMT?

    Hello, I am Pre-training a model for UNMT task as per the instructions given. Could you please tell me how to load a checkpoint? For XLM we could simply use --reload_checkpoint, could you tell me how to do the same here? Apparently --reload_checkpoint doesn't work. TIA

    opened by him-mah10 8
  • MASS Fairseq - A request

    MASS Fairseq - A request

    Hi, Its exciting that you have added the task to fairseq. One request would be to use fairseq as a library (pip installed), then registering tasks etc instead of forking.

    If some inflexibility in fairseq that prevents you from doing that, I will try to help fixing that in fairseq.

    opened by sai-prasanna 6
  • sentence size keeps exceeding max_tokens limit

    sentence size keeps exceeding max_tokens limit

    Hi, thank you so much for this library.

    I'm running the provided scripts for MASS-supNMT -- I have prepared the data and successfully ran the generate_enzh_data.sh script. Now, I'm trying to run the pretraining script, run_mass_enzh.sh

    I keep getting an assertion error similar to: AssertionError: sentence at index 5198756 of size 2170 exceeds max_tokens limit of 2048!

    I have gone through my mono and para data and deleted long sentences (over 175 tokens), and this error appeared again. Then, I changed the max tokens limit to 4096, and I got the assertion error:

    AssertionError: sentence at index 5198763 of size 5972 exceeds max_tokens limit of 4096!

    The sentence size has increased to exceed the size of the new limit.

    How do I resolve this issue?

    opened by moyid 5
  • Error during Fine-Tuning

    Error during Fine-Tuning

    Hi

    I run into errors when trying the new MASS-summarization code based on fairseq. I've downloaded the uncased base model from https://modelrelease.blob.core.windows.net/mass/mass-base-uncased.tar.gz and the binarized cnndm data from https://modelrelease.blob.core.windows.net/mass/cnndm.tar.gz

    When fine-tuning I immediately end up with the following stack trace. I'm using pytorch 1.2.0 - might this be the problem?

    File "/home/goto/anaconda3/envs/mas/bin/fairseq-train", line 10, in <module>
        sys.exit(cli_main())
      File "/home/goto/anaconda3/envs/mas/lib/python3.7/site-packages/fairseq_cli/train.py", line 321, in cli_main
        main(args)
      File "/home/goto/anaconda3/envs/mas/lib/python3.7/site-packages/fairseq_cli/train.py", line 68, in main
        extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
      File "/home/goto/anaconda3/envs/mas/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 126, in load_checkpoint
        epoch_itr = trainer.get_train_iterator(epoch=0)
      File "/home/goto/anaconda3/envs/mas/lib/python3.7/site-packages/fairseq/trainer.py", line 216, in get_train_iterator
        epoch=epoch,
      File "/home/goto/anaconda3/envs/mas/lib/python3.7/site-packages/fairseq/tasks/fairseq_task.py", line 153, in get_batch_iterator
        epoch=epoch,
      File "/home/goto/anaconda3/envs/mas/lib/python3.7/site-packages/fairseq/data/iterators.py", line 150, in __init__
        self.frozen_batches = tuple(batch_sampler)
      File "/home/goto/anaconda3/envs/mas/lib/python3.7/site-packages/fairseq/data/data_utils.py", line 216, in batch_by_size
        for idx in indices:
      File "/home/goto/anaconda3/envs/mas/lib/python3.7/site-packages/fairseq/data/data_utils.py", line 165, in filter_by_size
        for idx in itr:
      File "/home/goto/anaconda3/envs/mas/lib/python3.7/site-packages/fairseq/data/data_utils.py", line 119, in collect_filtered
        if function(el):
      File "/home/goto/anaconda3/envs/mas/lib/python3.7/site-packages/fairseq/data/data_utils.py", line 139, in check_size
        return size_fn(idx) <= max_positions
    TypeError: '<=' not supported between instances of 'tuple' and 'int```
    opened by hokkaido 5
  • Baseline Implementations

    Baseline Implementations

    Hi, thanks for sharing this work! Would really appreciate if you could point me out to the baseline implementations for unsupervised NMT -- specifically the BERT+LM and DAE methods with perhaps more details on the experimental parameters for the baseline tasks -- since I couldn't find them in the paper. Thanks.

    opened by bakszero 5
  • MASS-SUM mult-gpu question

    MASS-SUM mult-gpu question

    Hi , I am trying MASS-SUM,And i move those three files to fairseq but it get another error Traceback (most recent call last): File "/home/user/.local/bin/fairseq-train", line 11, in <module> load_entry_point('fairseq==0.8.0', 'console_scripts', 'fairseq-train')() File "/home/user/.local/lib/python3.7/site-packages/fairseq_cli/train.py", line 288, in cli_main parser = options.get_training_parser() File "/home/user/.local/lib/python3.7/site-packages/fairseq/options.py", line 22, in get_training_parser parser = get_parser('Trainer', default_task) File "/home/user/.local/lib/python3.7/site-packages/fairseq/options.py", line 164, in get_parser utils.import_user_module(usr_args) File "/home/user/.local/lib/python3.7/site-packages/fairseq/utils.py", line 288, in import_user_module importlib.import_module(module_name) File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 728, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/home/user/MASS/MASS-summarization/mass/__init__.py", line 1, in <module> from . import masked_s2s File "/home/user/MASS/MASS-summarization/mass/masked_s2s.py", line 19, in <module> from .masked_dataset import MaskedLanguagePairDataset ModuleNotFoundError: No module named 'mass.masked_dataset'

    And the files move to fairseq so original directory(mass) is no bert_dictionary.py ,masked_dataset.py ,learned_positional_embedding.py these three files, maybe is the reason But i keep this three files in original directory(mass) ,it's get ModuleNotFoundError: No module named 'mass' again Am i do something wrong?? Anyone help thank

    opened by jimmyhsia 4
  • Does mass implement the translate method?

    Does mass implement the translate method?

    I am using the mass translation model based on fairseq to apply it to the Django server.Therefore, in the translation phase, I try to use the translate method for reference: https://github.com/pytorch/fairseq/tree/main/examples/translation#example-usage-torchhub

    from fairseq.models.xtransformer import XTransformerModel zh2en = XTransformerModel.from_pretrained( '/work/qymeng5/MASS_test/model', checkpoint_file='model.pt', data_name_or_path='/work/qymeng5/MASS_test/vocabulary', #Vocabulary path bpe='subword_nmt', bpe_codes='/work/qymeng5/MASS_test/bpe/all.zh.bpe.codes' #Chinese BPE path )

    ff=zh2en.translate('你好 世界') print('ff',ff)

    But running this code will report an error. Is it because mass doesn't implement the translate method, or is it my code problem?

    opened by mqy9787 1
  • how can you get the data for  MASS supNMT?

    how can you get the data for MASS supNMT?

    following the MASS-supNMT README on Data Ready, What I want to ask is that how can I get the dataset for MASS supNMT. Can you provide some download link or script like get-data-nmt.sh for MASS unsupNMT

    opened by xiuzhilu 0
  • supNMT pre-train problem with multi gpus

    supNMT pre-train problem with multi gpus

    pre-train script from sup-nmt only run in single gpu. when i use multi gpus to pre-train supNMT, i get some problem below. Has anyone encountered the same situation?

    Traceback (most recent call last): File "/search/odin/txguo/anaconda3/envs/mass/bin/fairseq-train", line 8, in sys.exit(cli_main()) File "/search/odin/txguo/anaconda3/envs/mass/lib/python3.6/site-packages/fairseq_cli/train.py", line 298, in cli_main nprocs=args.distributed_world_size, File "/search/odin/txguo/anaconda3/envs/mass/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn while not spawn_context.join(): File "/search/odin/txguo/anaconda3/envs/mass/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 103, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL

    opened by Andrewlesson 1
  • How does MASS supervised machine translation perform preprocessing?

    How does MASS supervised machine translation perform preprocessing?

    Hello, I want to use the MASS model for supervised machine translation tasks (EN-DE), so how do I prepare the data before binarization? For example, what is monolingual data? How to make a bpe? How to make a tokenizer? You only provide a directory on the EN-ZH translation. Can you provide a script for processing? JQN48W0VQ~3PJ3MZGM%}S83

    Looking forward to your reply, thank you very much! @StillKeepTry

    opened by IdaBetsy 0
  • Translation results on Zh-En pre-trained model

    Translation results on Zh-En pre-trained model

    Dear author,

    The provided pre-train model produce strange translation results on single/short word translation, for example:

    hi --> 1 。 Hi ,你好 hey --> 嘿,嘿,嘿,嘿 dog --> 1 。一条经过训练的狗 apple --> 如 : (1) 苹果。

    In that case, the model is not usable at all if it fails in translating common words accurately. Is there any solution on eliminating such problem via training/fine-tuning?

    opened by riddlehk 0
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 3, 2021
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 9, 2023
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

null 20.5k Jan 8, 2023
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

null 11.3k Feb 18, 2021
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 986 Feb 17, 2021
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1000 Apr 19, 2021
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

null 13.2k Jul 7, 2021
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Yoon Kim 43 Dec 23, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

LightSeq is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transformer, etc. It is therefore best useful for Machine Translation, Text Generation, Dialog, Language Modelling, and other related tasks using these models.

Bytedance Inc. 2.5k Jan 3, 2023