Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Overview

PLBART

Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021.

Note. A detailed documentation is coming soon.

Pre-training data

PLBART is pre-trained on Java and Python functions and natural language descriptions collected from Github and StackOverflow.

Evaluation tasks

We evaluated PLBART on five tasks.

  • Code summarization [REF]
  • Code generation [REF]
  • Code translation [REF]
  • Clone detection [REF]
  • Vulnerability REF [REF]

Notes

  • We will publish the pretrained PLBART checkpoint soon.
  • We list all the files in this repository here.

Acknowledgement

PLBART uses Fairseq, codeXglue, and TransCoder and thanks the authors of these works for their contribution.

Citation

@inproceedings{ahmad2020summarization,
    author = {Ahmad, Wasi Uddin and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
    booktitle = {Proceedings of the 2021 Conference of the North {A}merican Chapter of the Association for Computational Linguistics},
    title = {Unified Pre-training for Program Understanding and Generation},
    year = {2021}
}
Comments
  • Multilingual `prepare.sh` throws an error after downloading

    Multilingual `prepare.sh` throws an error after downloading

    While running prepare.sh the followign errors are thrown for all the languages in multilingual directory:

    FileNotFoundError: [Errno 2] No such file or directory: '/home/crocoder/Desktop/transformers/PLBART/multilingual/data/processed/valid.php-en_XX.php'
    Traceback (most recent call last):
      File "encode.py", line 92, in <module>
        main()
      File "encode.py", line 88, in main
        process(args)
      File "encode.py", line 49, in process
        with open(args.input_source, 'r', encoding='utf-8') as f1, \
    FileNotFoundError: [Errno 2] No such file or directory: '/home/crocoder/Desktop/transformers/PLBART/multilingual/data/processed/test.php-en_XX.php'
    

    Any help on this?

    bug 
    opened by gchhablani 8
  • questions about dict.txt and data samples specify methods

    questions about dict.txt and data samples specify methods

    1. which step generates the dict.txt? It seems be generated during "fairseq-preprocess", but "fairseq-preprocess" also have a parameter "--srcdict $DICT_FILE".
    2. how do you make the machine know the end of a data sample(which is a function in this case)? It seems that you use "\n", but functions also have "\n", I am confused about this.
    3. Also, if fairseq-train FILENAME_OF_FIRST_DATASAMPLE: FILENAME_OF_SECOND_DATASAMPLE : FILENAME_OF_THIRD_DATASAMPLE:.....:FILENAME_OF_NTH_DATASAMPLE, will it work?

    I am new to this. Thanks.

    opened by freexxxyyy 6
  • Mismatch when loading the checkpoints

    Mismatch when loading the checkpoints

    Hi, thanks for your great work!

    When I tried to load the pre-trained checkpoints and fine tune, I came across the size mismatch problem. It seems that the dict.txt you provided does not match the checkpoints.

    Here is the error message:

    size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([50005, 768]) from checkpoint, the shape in current model is torch.Size([50001, 768]). size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([50005, 768]) from checkpoint, the shape in current model is torch.Size([50001, 768]). size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([50005, 768]) from checkpoint, the shape in current model is torch.Size([50001, 768]).

    This is the script I used to get the checkpoints: https://github.com/wasiahmad/PLBART/blob/main/pretrain/download.sh

    This is the dict.txt I used: https://github.com/wasiahmad/PLBART/blob/main/sentencepiece/dict.txt

    Here is the command I used to fine tune: fairseq-train $PATH_2_DATA \ --user-dir $USER_DIR --truncate-source \ --arch mbart_base --layernorm-embedding \ --task translation \ --source-lang $SOURCE --target-lang $TARGET \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --batch-size $BATCH_SIZE --update-freq $UPDATE_FREQ --max-epoch 30 \ --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \ --lr-scheduler polynomial_decay --lr 5e-05 --min-lr -1 \ --warmup-updates 500 --max-update 100000 \ --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.0 \ --seed 1234 --log-format json --log-interval 100 \ ${restore} \ --eval-bleu --eval-bleu-detok space --eval-tokenized-bleu \ --eval-bleu-remove-bpe sentencepiece --eval-bleu-args '{"beam": 5}' \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \ --no-epoch-checkpoints --patience 5 \ --ddp-backend no_c10d --save-dir $SAVE_DIR 2>&1 | tee ${OUTPUT_FILE};

    question 
    opened by JiyangZhang 6
  • AssertionError while evaluating 'translation'

    AssertionError while evaluating 'translation'

    Hi @wasiahmad ,

    I am trying 'translation' capabilities of PLBART and started finetuning as mentioned. But I'm getting below error in evaluation -

    File "calc_code_bleu.py", line 34, in <module>
        assert len(hypothesis) == len(pre_references[i])
    AssertionError
    

    Here is a bit detailed traceback -

    2021-07-15 13:57:58 | INFO | fairseq_cli.train | early stop since valid performance hasn't improved for last 10 runs
    2021-07-15 13:57:58 | INFO | fairseq_cli.train | begin save checkpoint
    2021-07-15 13:58:19 | INFO | fairseq.checkpoint_utils | saved checkpoint /content/PLBART/scripts/code_to_code/translation/java_cs/checkpoint_last.pt (epoch 22 @ 14168 updates, score 80.08) (writing took 20.417050701000335 seconds)
    2021-07-15 13:58:19 | INFO | fairseq_cli.train | end of epoch 22 (average epoch stats below)
    2021-07-15 13:58:19 | INFO | train | {"epoch": 22, "train_loss": "2.08", "train_nll_loss": "0.177", "train_ppl": "1.13", "train_wps": "1562.9", "train_ups": "1.76", "train_wpb": "890.1", "train_bsz": "16", "train_num_updates": "14168", "train_lr": "4.93409e-05", "train_gnorm": "0.534", "train_train_wall": "255", "train_wall": "8414"}
    2021-07-15 13:58:19 | INFO | fairseq_cli.train | done training in 8412.8 seconds
      0% 0/250 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py:172: UserWarning: --sacrebleu is deprecated. Please use --scoring sacrebleu instead.
      scorer = scoring.build_scorer(args, tgt_dict)
    Traceback (most recent call last):
      File "/usr/local/bin/fairseq-generate", line 8, in <module>
        sys.exit(cli_main())
      File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py", line 379, in cli_main
        main(args)
      File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py", line 41, in main
        return _main(args, sys.stdout)
      File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py", line 172, in _main
        scorer = scoring.build_scorer(args, tgt_dict)
      File "/usr/local/lib/python3.7/dist-packages/fairseq/scoring/__init__.py", line 54, in build_scorer
        return _build_scorer(args)
      File "/usr/local/lib/python3.7/dist-packages/fairseq/registry.py", line 54, in build_x
        return builder(args, *extra_args, **extra_kwargs)
      File "/usr/local/lib/python3.7/dist-packages/fairseq/scoring/bleu.py", line 40, in __init__
        character_tokenization=self.args.sacrebleu_char_level,
    AttributeError: 'Namespace' object has no attribute 'sacrebleu_char_level'
    Traceback (most recent call last):
      File "/content/PLBART/evaluation/evaluator.py", line 36, in <module>
        main()
      File "/content/PLBART/evaluation/evaluator.py", line 20, in main
        assert len(refs) == len(pres)
    AssertionError
    Traceback (most recent call last):
      File "calc_code_bleu.py", line 34, in <module>
        assert len(hypothesis) == len(pre_references[i])
    AssertionError
    

    Could you plz suggest how to proceed further..

    help wanted question 
    opened by saichandrapandraju 6
  • AttributeError: module 'sentencepiece' has no attribute 'SentencePieceProcessor'

    AttributeError: module 'sentencepiece' has no attribute 'SentencePieceProcessor'

    Firstly, I would like to thank you for your release and documentation. I am fine-tuning the text-to-code model and when I run the "scripts / text_to_code / prepare.sh" file I have the following error in the file "scripts/text_to_code/encode.py", line 30" :

    AttributeError: module 'sentencepiece' has no attribute 'SentencePieceProcessor'

    Any idea please ?

    question 
    opened by tmarrakchi 5
  • Script for training PLBART

    Script for training PLBART

    Hi,

    Thanks to the great contribution in code Generation. We would like to discover your PLBART model. Could you share with us the script for training PLBART for code generation?

    Thank you so much

    documentation help wanted 
    opened by tmarrakchi 5
  • Size of sample is invalid since max_positions=(1024, 1024)

    Size of sample is invalid since max_positions=(1024, 1024)

    Hi @wasiahmad , I trained PLBART for JAVA -> PYTHON translation. But while testing, I was getting below error -

    2021-07-21 05:31:11 | INFO | train | {"epoch": 30, "train_loss": "2.69", "train_nll_loss": "0.723", "train_ppl": "1.65", "train_wps": "7795.4", "train_ups": "0.35", "train_wpb": "22402", "train_bsz": "58.2", "train_num_updates": "240", "train_lr": "1.2e-05", "train_gnorm": "0.607", "train_train_wall": "5", "train_wall": "638"}
    2021-07-21 05:31:11 | INFO | fairseq_cli.train | done training in 637.0 seconds
    Traceback (most recent call last):
      File "/home/jovyan/.local/bin/fairseq-generate", line 8, in <module>
        sys.exit(cli_main())
      File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 379, in cli_main
        main(args)
      File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 41, in main
        return _main(args, sys.stdout)
      File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 132, in _main
        itr = task.get_batch_iterator(
      File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 227, in get_batch_iterator
        indices = self.filter_indices_by_size(
      File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 137, in filter_indices_by_size
        raise Exception(
    Exception: Size of sample #81 is invalid (=(1024, 1045)) since max_positions=(1024, 1024), skip this example with --skip-invalid-size-inputs-valid-test
    

    I didn't understand what (1024, 1045) and (1024, 1024) mean. I'm using default 510 for training and 9999 for testing as below -

    if [[ $SPLIT == 'test' ]]; then
            MAX_LEN=9999 # we do not truncate test sequences
        else
            MAX_LEN=510
    

    Could you plz suggest how to proceed further..?

    bug help wanted 
    opened by saichandrapandraju 4
  • Confused about the

    Confused about the "max-sentences" in pretraining

    Hi,

    In the pretraining script, you set the max-sentences to 32. Max-sentences is per GPU, so PER_GPU_TRAIN_BATCH_SIZE is 32. But the "max-tokens" is 2048 and the "tokens-per-sample" is 512, , so PER_GPU_TRAIN_BATCH_SIZE is 4. Why are these two parameters conflicted?

    Thanks

    question 
    opened by freexxxyyy 3
  • Missing

    Missing "java" token in Hugging Face Tokenizer

    Hi,

    I am trying to replicate the results of PLBART for the code refinement fine-tuning task using Hugging Face. When I tokenize methods that contain the "java" token and then decode them, the "java" token is strangely removed! Here is my code:

    code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
    tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base")
    model_inputs = tokenizer([code])
    print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
    # The code output is: "public void METHOD_1 ( TYPE_1 VAR_1 ) throws .lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
    

    Also, is there any hugging face implementation of the code refinement task using PLBART? My implementation does not achieve the EM and BLEU reported for the test set. I executed the existing fairseq implementation and got EM: 17.67​, however my hugging face implementation gets EM: 5.62! What important factors should I check?

    bug 
    opened by Ahmadreza-SY 3
  • Can we add structural information in PLBART?

    Can we add structural information in PLBART?

    I know this is not the perfect question to add into this repo. I understand PLABART is based on a transformer architecture and deal with sequence based text stream.

    But would be curious to know whether we can embed structural information like AST level edges in PLBART.

    opened by smith-co 3
  • unzip unsuccessful when running download.sh

    unzip unsuccessful when running download.sh

    I tried to fine tune the code refinement task on the PLBART paper, I set up the conda environment by bash install_env.sh, then download the checkpoints. However, when I run bash download.sh under data/codeXglue, I got this which seems that either unzip or the download was unccessful.

    image

    Am I missing some steps in the setup?

    opened by chungen04 3
Owner
Wasi Ahmad
I am a Ph.D. student in CS at UCLA.
Wasi Ahmad
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 9, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

null 37 Dec 4, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 9, 2023
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 1, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 3.2k Feb 17, 2021
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 2, 2022
TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Yixuan Su 26 Oct 17, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 6, 2023
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 9, 2022
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022