Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

Overview

ProphetNet-X

  1. This repo provides the code for reproducing the experiments in ProphetNet. In the paper, we propose a new pre-trained language model called ProphetNet for sequence-to-sequence learning with a novel self-supervised objective called future n-gram prediction.

  2. We have released the ProphetNet baselines for GLGE benchmark (A New General Language Generation Evaluation Benchmark) in here. Have a try! :)

  3. We provide ProphetNet-X family models for Chinses(ProphetNet-Zh), Multi-lingual(ProphetNet-Multi), English open domain dialog(ProphetNet-Dialog), Chinese open domain dialog(ProphetNet-Dialog-Zh), code generation(ProphetNet-Code). The details are described in ProphetNet-X paper.

This repo is still developing, feel free to report bugs and we will fix them ~

What's new

ProphetNet-X models are released!

Try new ProphetNet pretrained models for Chinese, English Dialog, Chinese Dialog, Multi-lingual, and Code Generation.

Different ProphetNet-X models have the only difference of the vocabulary file. Simply modify one model file and you can evaluate your idea with all the pretrained models and finetuning scripts!

Future updates

  1. ProphetNet pretrained models for bio-medical text.
  2. ProphetNet pretrained models for protein.
  3. New ProphetNet models for long document modeling.
  4. New algorithms for Transformer/ProphetNet to reduce inference latency with no hurt to the results.
  5. New ProphetNet models for non-auto-regressive generation.
  6. For Natural Language Understanding tasks.

Dependency

  • pip install torch==1.3.0
  • pip install fairseq==v0.9.0
  • pip install tensorboardX==1.7

Pre-trained Models

We have released the following checkpoints for pre-trained models as described in the paper of ProphetNet-X(appear soon).

ProphetNet-X is based on ProphetNet, which also serves the ProphetNet-En model.

Recommended Checkpoints:

Expired Checkpoints:

How to use

The procedure includes 1) Tokenize, 2) Binarize, 3) Finetune, 4) Inference.
ProphetNet is implemented on base of Fairseq, which you can refer to Fairseq Mannual.

For all the ProphetNet-X models, the only difference is the dictionary, which means different Tokenizers should be used.

We take ProphetNet-En for example:

Tokenize. Prepare your train.src, train.tgt, and valid, test sets. Input and output of one sample are placed in the .src and .tgt file with one line.
Use bert-uncased tokenizer to tokenize your data into word piece.

from transformers import BertTokenizer


def bert_uncased_tokenize(fin, fout):
    fin = open(fin, 'r', encoding='utf-8')
    fout = open(fout, 'w', encoding='utf-8')
    tok = BertTokenizer.from_pretrained('bert-base-uncased')
    for line in fin:
        word_pieces = tok.tokenize(line.strip())
        new_line = " ".join(word_pieces)
        fout.write('{}\n'.format(new_line))
bert_uncased_tokenize('train.src', 'tokenized_train.src')
bert_uncased_tokenize('train.tgt', 'tokenized_train.tgt')
bert_uncased_tokenize('valid.src', 'tokenized_valid.src')
bert_uncased_tokenize('valid.tgt', 'tokenized_valid.tgt')
bert_uncased_tokenize('test.src', 'tokenized_test.src')
bert_uncased_tokenize('test.tgt', 'tokenized_test.tgt')

Binirize it with fairseq-preprocess

fairseq-preprocess \
--user-dir prophetnet \
--task translation_prophetnet \
--source-lang src --target-lang tgt \
--trainpref tokenized_train --validpref tokenized_valid --testpref tokenized_test \
--destdir processed --srcdict vocab.txt --tgtdict vocab.txt \
--workers 20

Fine tune with fairseq-train.
--disable-ngram-loss:only keep the next first token loss.
--ngram: number of future tokens to predict. Provided pretrained checkpoint predicts 2 future tokens, and you should set it as 2 to be consistent.
If your device does not support float16, remove --fp16.

DATA_DIR=processed
USER_DIR=./prophetnet
ARCH=ngram_transformer_prophet_large
CRITERION=ngram_language_loss
SAVE_DIR=./model
TENSORBOARD_LOGDIR=./logs
PRETRAINED_MODEL=pretrained_checkpoints/prophetnet_en.pt

fairseq-train \
--fp16 \
--user-dir $USER_DIR --task translation_prophetnet --arch $ARCH \
--optimizer adam --adam-betas '(0.9, 0.999)' --clip-norm 0.1 \
--lr 0.00001 --min-lr 1e-09 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 1000 \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--criterion $CRITERION --label-smoothing 0.1 \
--update-freq 1  --max-tokens 1400 --max-sentences 7 \
--num-workers 4 \
--load-from-pretrained-model $PRETRAINED_MODEL \
--ddp-backend=no_c10d --max-epoch 10 \
--max-source-positions 512 --max-target-positions 512 \
--skip-invalid-size-inputs-valid-test \
--save-dir $SAVE_DIR \
--keep-last-epochs 10 \
--tensorboard-logdir $TENSORBOARD_LOGDIR \
$DATA_DIR

Inference with fairseq-generate to generate targets for given processed test files. Or you can fairseq-interactive to generate answers for your typed-in text (which should also been tokenized).

BEAM=5
LENPEN=1.5
CHECK_POINT=./model/checkpoint5.pt
TEMP_FILE=fairseq_outputs.txt
OUTPUT_FILE=sorted_outputs.txt

fairseq-generate processed --path $CHECK_POINT --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --no-repeat-ngram-size 3 --lenpen $LENPEN 2>&1 > $TEMP_FILE
grep ^H $TEMP_FILE | cut -c 3- | sort -n | cut -f3- | sed "s/ ##//g" > $OUTPUT_FILE

TIPS:

If you met problems to run fairseq-preprocess, fairseq-train and other commands, or if you want to modify the workflow/inference pipeline, it's a good choice to download fairseq git repo, checkout v0.9.0, and merge our codes.
Then, modify their preprocess.py, train.py or generate.py, to run your new pipeline.

Repo Reference

This repo is partially referred to Fairseq-v0.9.0 and MASS.

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@inproceedings{qi2020prophetnet,
  title={Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training},
  author={Qi, Weizhen and Yan, Yu and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings},
  pages={2401--2410},
  year={2020}
}
@article{qi2021prophetnet,
  title={ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation},
  author={Qi, Weizhen and Gong, Yeyun and Yan, Yu and Xu, Can and Yao, Bolun and Zhou, Bartuer and Cheng, Biao and Jiang, Daxin and Chen, Jiusheng and Zhang, Ruofei and others},
  journal={arXiv preprint arXiv:2104.08006},
  year={2021}
}

Microsoft Open Source Code of Conduct

Comments
  • Abstractive Summarization using ProphetNet

    Abstractive Summarization using ProphetNet

    I'm following these steps to summarize my document -

    1. download CNN\DM fine-tuned checkpoint
    2. preprocess your text with BERT-tokenization, and you can refer to our preprocess scripts
    3. use fairseq-generate or fairseq-interactive to generate summarization for your given text. For fairseq-generate, you can refer to our generate scripts. For fairseq-interactive, you can easily generate summarization for a typed-in text interactively. Detailed instructions can be found in fairseq manual

    What is the --task argument for summarization?

    Also, would this be sufficient if my processed input is in 2.txt?

    fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task summarization_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE

    opened by harshithbelagur 10
  • KeyError:

    KeyError: "best loss", when loading checkpoint as Fairseq Model

    Hi guys,

    Thank you for the incredible work.

    I tried to load this model from the larger checkpoint in the following manner:

    from fairseq.models.transformer import TransformerModel
    
    model = TransformerModel.from_pretrained(model_name_or_path=MODEL_DIR,  \
                                             checkpoint_file='prophetnet_large_pretrained_160G_14epoch_model.pt')
    

    but was presented with a key error:

    KeyError                                  Traceback (most recent call last)
    <ipython-input-13-782ea15f21fd> in <module>()
          1 MODEL_DIR = '/content/drive/My Drive/src/models/'
    ----> 2 model = TransformerModel.from_pretrained(model_name_or_path=MODEL_DIR,                                         checkpoint_file='prophetnet_large_pretrained_160G_14epoch_model.pt')
    
    4 frames
    /usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py in _upgrade_state_dict(state)
        298     if "optimizer_history" not in state:
        299         state["optimizer_history"] = [
    --> 300             {"criterion_name": "CrossEntropyCriterion", "best_loss": state["best_loss"]}
        301         ]
        302         state["last_optimizer_state"] = state["optimizer"]
    
    KeyError: 'best_loss'
    

    Versions fairseq==0.9.0 torch==1.4.0

    Any advice on how to proceed would be greatly appreciated, I wish to load ProphetNet into a fairseq model so I can adapt the architecture to a custom task.

    opened by chrisdoyleIE 7
  • Increasing --max-source-positions --max-target-positions

    Increasing --max-source-positions --max-target-positions

    Hi again,

    I was finetuning some data with --max-source-positions 1024 --max-target-positions 1024.

    But it paused at epoch 001: 8%.
    and showed: WARNING: overflow detected, setting loss scale to: 64.0 Is there, any upper limit with **--max-source-positions & --max-target-positions **.

    I am training with 4 Tesla T4 GPUs.

    Please help.

    opened by ShoubhikBanerjee 6
  • Is it possible to run prophetnet on 11G memory GPUs?

    Is it possible to run prophetnet on 11G memory GPUs?

    I tried to run prophetnet on 2080ti(11G memory) with Question Generation task. However, even if I set the max-sentences as 1, it still be out of memory. So I wonder whether it is possible to run this model on 11G memory GPU. Because it has similar structure and size to the other pretrained models like BERT and Unilm, which I can run them on 11G memory GPUs.

    opened by Brandonnogithub 5
  • Provide generated outputs

    Provide generated outputs

    Hi all, Thanks for sharing the code and models. Is it possible to directly provide the generated outputs of the model? I am specifically interested in the summarization task and would like to just have the outputs instead of decoding them myself using the pretrained model. I understand Gigaword might be subject to license issues, but the CNN/DailyMail outputs would suffice.

    Thanks!

    opened by shahbazsyed 5
  • RuntimeError: unexpected EOF. Corrupted File?

    RuntimeError: unexpected EOF. Corrupted File?

    Hello,

    I performed the following:

    1. Clone prophetnet repository
    2. Installed torch and fairseq
    3. Download ProphetNet-large-160GB pre-trained model
    4. Download CNN/DM data
    5. Preprocess CNN/DM data via preprocess_cnn_dm.py
    6. Use fairseq-preprocess to generate binaries

    When I run fairseq-train or inference fairseq-generate, I get the following errors: Train

    Traceback (most recent call last):
      File "/usr/local/bin/fairseq-train", line 11, in <module>
        sys.exit(cli_main())
      File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 333, in cli_main
        main(args)
      File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 51, in main
        model = task.build_model(args)
      File "/usr/local/lib/python3.6/dist-packages/fairseq/tasks/fairseq_task.py", line 185, in build_model
        return models.build_model(args, self)
      File "/usr/local/lib/python3.6/dist-packages/fairseq/models/__init__.py", line 48, in build_model
        return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
      File "/workspace/ProphetNet/src/prophetnet/ngram_s2s_model.py", line 147, in build_model
        states = torch.load(args.load_from_pretrained_model, map_location='cpu')
      File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 529, in load
        return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
      File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 709, in _legacy_load
        deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
    RuntimeError: unexpected EOF, expected 1092436 more bytes. The file might be corrupted.
    

    Inference

    Traceback (most recent call last):  File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 151, in load_checkpoint_to_cpu    from fairseq.fb_pathmgr import fb_pathmgr
    ModuleNotFoundError: No module named 'fairseq.fb_pathmgr'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/local/bin/fairseq-generate", line 11, in <module>
        sys.exit(cli_main())
      File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/generate.py", line 199, in cli_main
        main(args)
      File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/generate.py", line 47, in main
        task=task,
      File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 179, in load_model_ensemble
        ensemble, args, _task = load_model_ensemble_and_task(filenames, arg_overrides, task)
      File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 190, in load_model_ensemble_and_task
        state = load_checkpoint_to_cpu(filename, arg_overrides)
      File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 160, in load_checkpoint_to_cpu
        path, map_location=lambda s, l: default_restore_location(s, "cpu")
      File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 529, in load
        return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
      File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 709, in _legacy_load
        deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
    RuntimeError: unexpected EOF, expected 5239485 more bytes. The file might be corrupted.
    

    Inputs:

    Train

    fairseq-train \
    --fp16 \
    --user-dir ./prophetnet --task translation_prophetnet --arch ngram_transformer_prophet_large \
    --optimizer adam --adam-betas '(0.9, 0.999)' --clip-norm 0.1 \
    --lr 0.0001 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 1000 \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --criterion ngram_language_loss --label-smoothing 0.1 \
    --update-freq 32  --max-sentences 2 \
    --num-workers 4 \
    --load-from-pretrained-model ../prophetnet_large_pretrained_160G_14epoch_model.pt \
    --load-sep \
    --ddp-backend=no_c10d --max-epoch 10 \
    --max-source-positions 512 --max-target-positions 512 \
    --skip-invalid-size-inputs-valid-test \
    --seed 1 \
    --save-dir ./cnndm/finetune_cnndm_checkpoints \
    --keep-last-epochs 10 \
    --tensorboard-logdir ./cnndm/finetune_cnndm_tensorboard \
    ./cnndm/processed
    

    Inference

    fairseq-generate \
    ./cnndm/processed \
    --path ../prophetnet_large_pretrained_16G_64epoch_model.pt \
    --user-dir prophetnet \
    --task translation_prophetnet \
    --batch-size 32 \
    --gen-subset test \
    --beam 5 \
    --num-workers 4 \
    --min-len 45 \
    --max-len-b 110 \
    --no-repeat-ngram-size 3 --lenpen 1.2 2>&1 > ../logs.output
    

    Any idea how to handle this? Thank you.

    opened by gouldju1 4
  • Assertion Error in fine-tuning of Gigaword

    Assertion Error in fine-tuning of Gigaword

    Hi, thank you for distributing your code! I tried to fine-tune the pre-trained ProphetNet (160G) on English Gigaword summarization dataset. I conducted pre-processing described in README and then tried fine-tuning but faced the following Assertion Error:

      File "~/anaconda3/envs/py36pytorch14/bin/fairseq-train", line 8, in <module>
        sys.exit(cli_main())
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq_cli/train.py", line 333, in cli_main
        main(args)
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq_cli/train.py", line 86, in main
        train(args, trainer, task, epoch_itr)
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq_cli/train.py", line 126, in train
        for i, samples in enumerate(progress, start=epoch_itr.iterations_in_epoch):
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/tqdm/std.py", line 1127, in __iter__
        for obj in iterable:
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq/data/iterators.py", line 314, in __next__
        chunk.append(next(self.itr))
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq/data/iterators.py", line 43, in __next__
        return next(self.itr)
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq/data/iterators.py", line 36, in __iter__
        for x in self.iterable:
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
        data = self._next_data()
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
        return self._process_data(data)
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
        data.reraise()
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
        raise self.exc_type(msg)
    AssertionError: Caught AssertionError in DataLoader worker process 0.
    Original Traceback (most recent call last):
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
        data = fetcher.fetch(index)
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
        return self.collate_fn(data)
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq/data/language_pair_dataset.py", line 252, in collater
        input_feeding=self.input_feeding,
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq/data/language_pair_dataset.py", line 69, in collate
        move_eos_to_beginning=True,
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq/data/language_pair_dataset.py", line 22, in merge
        pad_idx, eos_idx, left_pad, move_eos_to_beginning,
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq/data/data_utils.py", line 44, in collate_tokens
        copy_tensor(v, res[i][size - len(v):] if left_pad else res[i][:len(v)])
      File "~/anaconda3/envs/py36pytorch14/lib/python3.6/site-packages/fairseq/data/data_utils.py", line 37, in copy_tensor
        assert src[-1] == eos_idx
    AssertionError
    

    pytorch version == 1.4.0

    fairseq version == 0.9.0

    In addition, when I tried to train the original Transformer (--arch transformer_wmt_en_de) with label_smoothed_cross_entropy, I succeeded training.

    Do you have any idea to solve the above error?

    opened by takase 2
  • Train new model

    Train new model

    Hi, Thanks for your awesome model Could I ask how to train a whole new model for a specific task for a language like summarizing Vietnamese articles.

    Would you mind provide some instructions on that?

    Edited: I have successfully finetune from pretrained model ProphetNet X to create Vietnamese model. However, I also want to create new model as well.

    @qiweizhen @yuyan2do @dayihengliu

    opened by stoicity 1
  • evaluting causing error for  infer language pair on pretrained cnndm

    evaluting causing error for infer language pair on pretrained cnndm

    fairseq-generate cnndm/processed --path /e/workspace/ProphetNet/a.pt --user-dir prophetnet --task translation_prophetnet --batch-size 32 --gen-subset test --beam 5 --num-workers 4 --min-len 45 --max-len-b 110  --no-repeat-ngram-size 3 --lenpen 1.2 2>&1 > cnndm/output-ck9-pelt1.2-test-beam5.txt
    

    using the above command for Inference and Evaluation but causing an error on a pre-trained model for CNN/Daily Mail

    Traceback (most recent call last):
      File "D:\windows_program\conda\envs\p\Scripts\fairseq-generate-script.py", line 33, in <module>
        sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')())
      File "e:\fairseq\fairseq_cli\generate.py", line 270, in cli_main
        main(args)
      File "e:\fairseq\fairseq_cli\generate.py", line 36, in main
        return _main(args, sys.stdout)
      File "e:\fairseq\fairseq_cli\generate.py", line 57, in _main
        task = tasks.setup_task(args)
      File "e:\fairseq\fairseq\tasks\__init__.py", line 17, in setup_task
        return TASK_REGISTRY[args.task].setup_task(args, **kwargs)
      File "e:\fairseq\fairseq\tasks\translation.py", line 226, in setup_task
        raise Exception('Could not infer language pair, please provide it explicitly')
    Exception: Could not infer language pair, please provide it explicitly
    

    but in docs their no such arguments for fairseq-generate

    fairseq 0.9.0 torch 1.5.1 model prophetnet_large_160G_cnndm_model.pt

    opened by amiyamandal-dev 1
  • fixed error in tokenization code for other datasets in readme

    fixed error in tokenization code for other datasets in readme

    In this code sample in the README which helps users tokenize their own data, there is a variable used new, which is not defined in the scope of the function. I believe the intended variable to be used is word_pieces.

    opened by ManavR123 1
  • Loading .pt into fairseq model for customisation

    Loading .pt into fairseq model for customisation

    Hi guys,

    really incredible work, thank you.

    May I please ask for a way of loading the available checkpoints into its fairseq model, such that someone can build upon your architecture?

    Specifically, the "bpe" and "bpe_codes" arguments as below are what I'm trying to identify.

    Screenshot 2020-04-05 at 13 59 44
    opened by chrisdoyleIE 1
  • Selecting additional scoring methods for fine-tuning

    Selecting additional scoring methods for fine-tuning

    We have started to fine-tune the ProphetNet model on a custom dataset. We are using fairseq==v0.9.0 version. Currently, only perplexity is supported while training, however, we would like also to validate the trained model on BLEU-4, METEOR, and ROUGE metrics. Can anyone provide any insights on this? Because the "--scoring" parameter in fairseq v0.9 is not supported.

    opened by mtsourma 0
  • fix-bug: fix attn transpose bug

    fix-bug: fix attn transpose bug

    Hi, I seem to find a bug in the code.

    In extract_features function of NgramTransformerDecoder, a transpose operation is applied to attn, which is the output of NgramTransformerDecoderLayer . The code snippet is as follows:

    class NgramTransformerDecoder(FairseqIncrementalDecoder):
        def extract_features(self, prev_output_tokens, encoder_out=None, incremental_state=None, **unused):
            # ......
            # decoder layers
            for layer in self.layers:
                x, attn = layer(
                    x,
                    encoder_out['encoder_out'] if encoder_out is not None else None,
                    encoder_out['encoder_padding_mask'] if encoder_out is not None else None,
                    incremental_state,
                    self_attn_mask=self_attn_mask,
                    ngram_mask_matrix=ngram_mask_matrix,
                    i_buckets_main_stream=i_buckets_main_stream,
                    i_bucket_relative_stream=i_bucket_relative_stream,
                    real_positions=real_positions
                )
                inner_states.append(x)
            # TODO [(1+ngram)*T, B, C] -> [B, (1+ngram)*T, C]
            x_list = x.transpose(0, 1).chunk(1 + self.ngram, 1)
            if attn is not None:
                attn_list = attn.transpose(0, 1).chunk(1 + self.ngram, 1)
            else:
                attn_list = None
    
            return x_list, {'attn': attn_list}
    

    As can be seen from the code comments, it's purpose is to change the dims from [(1+ngram)*T, B, C] to [B, (1+ngram)*T, C]. The variable attn, from NgramTransformerDecoderLayer, is the second result returned by its encoder_attn(fairseq.modules.MultiheadAttention).

    In fairseqv0.9.0, the code snippet of MultiheadAttention's forward function is as follows:

    class MultiheadAttention(nn.Module):
        def forward(
            self,
            # ...
        ):
            # ......
            if need_weights:
                attn_weights = attn_weights_float.view(bsz, self.num_heads, tgt_len, src_len).transpose(1, 0)
                if not need_head_weights:
                    # average attention weights over heads
                    attn_weights = attn_weights.mean(dim=0)
            else:
                attn_weights = None
    
            return attn, attn_weights
    

    It can be seen that, the second result of forward function(attn_weights), has the shape (bsz, self.num_heads, tgt_len, src_len) originally. After transpose and mean operator, it has the shape (bsz, tgt_len, src_len), which is the actual shape of attn mentioned in extract_features rather than (1+ngram)*T, B, C described in the comment. BTW, shape and transpose of x in extract_features is right. And the attn is not actually used during training and inferencing. So I guess it's the reason why it has not been found for 2 years.

    But if one wants to some modification and needs to use the variable attn , like me, will find it has a confusing shape caused by the transpose operator. And it does take me some time to find the bug.

    Hoping the PR can be merged.

    opened by tqnwhz 1
  • About ProphetNet-Dialog-En in dialogue dataset

    About ProphetNet-Dialog-En in dialogue dataset

    I want to know whether the personachat dataset is set under the feed shot setting during model tuning, but I think the data preprocessing code seems to use all files。Thanks.

    opened by xiang-xiang-zhu 0
  • Wrong Tokenization in SquadQG Evaluation Scripts

    Wrong Tokenization in SquadQG Evaluation Scripts

    Thanks for the great work.

    I am reproducing the result reported in GLGE but find that the SquadQG evaluation script seem to use wrong tokenization.

    In /script/evaluate/qg/eval_on_unilm_qg.py, the generated text are post-processed by fix_tokenization:

    https://github.com/microsoft/ProphetNet/blob/0a1b59cb95783319b7b58ede65b768587dc49daf/GLGE_baselines/script/script/evaluate/qg/eval_on_unilm_qg.py#L40-L117

    For example, it turns . . . to ..., " to '', 1 , 000 to 1,000.

    However, the original data do not like the sentence after fix_tokenization. Here are some samples from the test set:

    What did Harff define as " short - lived outbursts by mobs . . . ? "
    Who sang " Girls Love Beyoncé " in 2013 ?
    What city in Montana has over 100 , 000 people ?
    

    Moreover, I reproduce MASS-base and find the results are higher if we disable fix_tokenization:

    | | BLEU | METEOR | ROUGE-L | |----------------------------------------------|-------|--------|---------| | MASS-base reported in GLGE | 20.1 | 24.4 | 49.4 | | MASS-base reproduce with fix_tokenization | 20.69 | 24.92 | 49.21 | | MASS-base reproduce without fix_tokenization | 22.54 | 25.03 | 50.27 |

    I wonder whether I miss somthing or the reported results use a wrong tokenization? I also hope that, if possible, the model outputs can be released to support fair and detailed comparisons.

    Looking forward to your reply

    opened by hzhwcmhf 0
  • KeyError during inference with dialog-en model

    KeyError during inference with dialog-en model

    Hi,

    Using fairseq cli, I ran the preprocessing for test files only to generate binaries, and then tried running the inference with prophetnet-dialog-en model.

    Here is my code: ` fairseq-preprocess
    --user-dir prophetnet
    --task translation_prophetnet
    --source-lang src --target-lang tgt
    --testpref tokenized_test
    --destdir processed --srcdict vocab.txt --tgtdict vocab.txt
    --workers 20

    BEAM=5 LENPEN=1.5 CHECK_POINT=prophetnet-dialog-en.pt TEMP_FILE=fairseq_outputs.txt OUTPUT_FILE=sorted_outputs.txt

    fairseq-generate processed --path $CHECK_POINT --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --no-repeat-ngram-size 3 --lenpen $LENPEN 2>&1 > $TEMP_FILE grep ^H $TEMP_FILE | cut -c 3- | sort -n | cut -f3- | sed "s/ ##//g" > $OUTPUT_FILE`

    I got the following error. Would appreciate any advice on this. Thank you!

    /usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py in _upgrade_state_dict(state) --> 300 {"criterion_name": "CrossEntropyCriterion", "best_loss": state["best_loss"]} KeyError: 'best_loss'

    opened by raviteja5 1
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 1, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 3.2k Feb 17, 2021
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Amazon 69 Jan 3, 2023
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

null 1.1k Dec 27, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Francis R. Willett 305 Dec 22, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

null 49 Dec 17, 2022
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 44 Jan 6, 2023
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
null 189 Jan 2, 2023
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Lars Mescheder 884 Nov 11, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 9, 2022
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

null 79 Dec 27, 2022