Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Related tags

Deep Learning PLBART
Overview

PLBART

Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021.

Note. A detailed documentation is coming soon.

Pre-training data

PLBART is pre-trained on Java and Python functions and natural language descriptions collected from Github and StackOverflow.

Evaluation tasks

We evaluated PLBART on five tasks.

  • Code summarization [REF]
  • Code generation [REF]
  • Code translation [REF]
  • Clone detection [REF]
  • Vulnerability REF [REF]

Notes

  • We will publish the pretrained PLBART checkpoint soon.
  • We list all the files in this repository here.

Acknowledgement

PLBART uses Fairseq, codeXglue, and TransCoder and thanks the authors of these works for their contribution.

Citation

@inproceedings{ahmad2020summarization,
    author = {Ahmad, Wasi Uddin and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
    booktitle = {Proceedings of the 2021 Conference of the North {A}merican Chapter of the Association for Computational Linguistics},
    title = {Unified Pre-training for Program Understanding and Generation},
    year = {2021}
}
Comments
  • Multilingual `prepare.sh` throws an error after downloading

    Multilingual `prepare.sh` throws an error after downloading

    While running prepare.sh the followign errors are thrown for all the languages in multilingual directory:

    FileNotFoundError: [Errno 2] No such file or directory: '/home/crocoder/Desktop/transformers/PLBART/multilingual/data/processed/valid.php-en_XX.php'
    Traceback (most recent call last):
      File "encode.py", line 92, in <module>
        main()
      File "encode.py", line 88, in main
        process(args)
      File "encode.py", line 49, in process
        with open(args.input_source, 'r', encoding='utf-8') as f1, \
    FileNotFoundError: [Errno 2] No such file or directory: '/home/crocoder/Desktop/transformers/PLBART/multilingual/data/processed/test.php-en_XX.php'
    

    Any help on this?

    bug 
    opened by gchhablani 8
  • questions about dict.txt and data samples specify methods

    questions about dict.txt and data samples specify methods

    1. which step generates the dict.txt? It seems be generated during "fairseq-preprocess", but "fairseq-preprocess" also have a parameter "--srcdict $DICT_FILE".
    2. how do you make the machine know the end of a data sample(which is a function in this case)? It seems that you use "\n", but functions also have "\n", I am confused about this.
    3. Also, if fairseq-train FILENAME_OF_FIRST_DATASAMPLE: FILENAME_OF_SECOND_DATASAMPLE : FILENAME_OF_THIRD_DATASAMPLE:.....:FILENAME_OF_NTH_DATASAMPLE, will it work?

    I am new to this. Thanks.

    opened by freexxxyyy 6
  • Mismatch when loading the checkpoints

    Mismatch when loading the checkpoints

    Hi, thanks for your great work!

    When I tried to load the pre-trained checkpoints and fine tune, I came across the size mismatch problem. It seems that the dict.txt you provided does not match the checkpoints.

    Here is the error message:

    size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([50005, 768]) from checkpoint, the shape in current model is torch.Size([50001, 768]). size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([50005, 768]) from checkpoint, the shape in current model is torch.Size([50001, 768]). size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([50005, 768]) from checkpoint, the shape in current model is torch.Size([50001, 768]).

    This is the script I used to get the checkpoints: https://github.com/wasiahmad/PLBART/blob/main/pretrain/download.sh

    This is the dict.txt I used: https://github.com/wasiahmad/PLBART/blob/main/sentencepiece/dict.txt

    Here is the command I used to fine tune: fairseq-train $PATH_2_DATA \ --user-dir $USER_DIR --truncate-source \ --arch mbart_base --layernorm-embedding \ --task translation \ --source-lang $SOURCE --target-lang $TARGET \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --batch-size $BATCH_SIZE --update-freq $UPDATE_FREQ --max-epoch 30 \ --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \ --lr-scheduler polynomial_decay --lr 5e-05 --min-lr -1 \ --warmup-updates 500 --max-update 100000 \ --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.0 \ --seed 1234 --log-format json --log-interval 100 \ ${restore} \ --eval-bleu --eval-bleu-detok space --eval-tokenized-bleu \ --eval-bleu-remove-bpe sentencepiece --eval-bleu-args '{"beam": 5}' \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \ --no-epoch-checkpoints --patience 5 \ --ddp-backend no_c10d --save-dir $SAVE_DIR 2>&1 | tee ${OUTPUT_FILE};

    question 
    opened by JiyangZhang 6
  • AssertionError while evaluating 'translation'

    AssertionError while evaluating 'translation'

    Hi @wasiahmad ,

    I am trying 'translation' capabilities of PLBART and started finetuning as mentioned. But I'm getting below error in evaluation -

    File "calc_code_bleu.py", line 34, in <module>
        assert len(hypothesis) == len(pre_references[i])
    AssertionError
    

    Here is a bit detailed traceback -

    2021-07-15 13:57:58 | INFO | fairseq_cli.train | early stop since valid performance hasn't improved for last 10 runs
    2021-07-15 13:57:58 | INFO | fairseq_cli.train | begin save checkpoint
    2021-07-15 13:58:19 | INFO | fairseq.checkpoint_utils | saved checkpoint /content/PLBART/scripts/code_to_code/translation/java_cs/checkpoint_last.pt (epoch 22 @ 14168 updates, score 80.08) (writing took 20.417050701000335 seconds)
    2021-07-15 13:58:19 | INFO | fairseq_cli.train | end of epoch 22 (average epoch stats below)
    2021-07-15 13:58:19 | INFO | train | {"epoch": 22, "train_loss": "2.08", "train_nll_loss": "0.177", "train_ppl": "1.13", "train_wps": "1562.9", "train_ups": "1.76", "train_wpb": "890.1", "train_bsz": "16", "train_num_updates": "14168", "train_lr": "4.93409e-05", "train_gnorm": "0.534", "train_train_wall": "255", "train_wall": "8414"}
    2021-07-15 13:58:19 | INFO | fairseq_cli.train | done training in 8412.8 seconds
      0% 0/250 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py:172: UserWarning: --sacrebleu is deprecated. Please use --scoring sacrebleu instead.
      scorer = scoring.build_scorer(args, tgt_dict)
    Traceback (most recent call last):
      File "/usr/local/bin/fairseq-generate", line 8, in <module>
        sys.exit(cli_main())
      File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py", line 379, in cli_main
        main(args)
      File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py", line 41, in main
        return _main(args, sys.stdout)
      File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py", line 172, in _main
        scorer = scoring.build_scorer(args, tgt_dict)
      File "/usr/local/lib/python3.7/dist-packages/fairseq/scoring/__init__.py", line 54, in build_scorer
        return _build_scorer(args)
      File "/usr/local/lib/python3.7/dist-packages/fairseq/registry.py", line 54, in build_x
        return builder(args, *extra_args, **extra_kwargs)
      File "/usr/local/lib/python3.7/dist-packages/fairseq/scoring/bleu.py", line 40, in __init__
        character_tokenization=self.args.sacrebleu_char_level,
    AttributeError: 'Namespace' object has no attribute 'sacrebleu_char_level'
    Traceback (most recent call last):
      File "/content/PLBART/evaluation/evaluator.py", line 36, in <module>
        main()
      File "/content/PLBART/evaluation/evaluator.py", line 20, in main
        assert len(refs) == len(pres)
    AssertionError
    Traceback (most recent call last):
      File "calc_code_bleu.py", line 34, in <module>
        assert len(hypothesis) == len(pre_references[i])
    AssertionError
    

    Could you plz suggest how to proceed further..

    help wanted question 
    opened by saichandrapandraju 6
  • AttributeError: module 'sentencepiece' has no attribute 'SentencePieceProcessor'

    AttributeError: module 'sentencepiece' has no attribute 'SentencePieceProcessor'

    Firstly, I would like to thank you for your release and documentation. I am fine-tuning the text-to-code model and when I run the "scripts / text_to_code / prepare.sh" file I have the following error in the file "scripts/text_to_code/encode.py", line 30" :

    AttributeError: module 'sentencepiece' has no attribute 'SentencePieceProcessor'

    Any idea please ?

    question 
    opened by tmarrakchi 5
  • Script for training PLBART

    Script for training PLBART

    Hi,

    Thanks to the great contribution in code Generation. We would like to discover your PLBART model. Could you share with us the script for training PLBART for code generation?

    Thank you so much

    documentation help wanted 
    opened by tmarrakchi 5
  • Size of sample is invalid since max_positions=(1024, 1024)

    Size of sample is invalid since max_positions=(1024, 1024)

    Hi @wasiahmad , I trained PLBART for JAVA -> PYTHON translation. But while testing, I was getting below error -

    2021-07-21 05:31:11 | INFO | train | {"epoch": 30, "train_loss": "2.69", "train_nll_loss": "0.723", "train_ppl": "1.65", "train_wps": "7795.4", "train_ups": "0.35", "train_wpb": "22402", "train_bsz": "58.2", "train_num_updates": "240", "train_lr": "1.2e-05", "train_gnorm": "0.607", "train_train_wall": "5", "train_wall": "638"}
    2021-07-21 05:31:11 | INFO | fairseq_cli.train | done training in 637.0 seconds
    Traceback (most recent call last):
      File "/home/jovyan/.local/bin/fairseq-generate", line 8, in <module>
        sys.exit(cli_main())
      File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 379, in cli_main
        main(args)
      File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 41, in main
        return _main(args, sys.stdout)
      File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 132, in _main
        itr = task.get_batch_iterator(
      File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 227, in get_batch_iterator
        indices = self.filter_indices_by_size(
      File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 137, in filter_indices_by_size
        raise Exception(
    Exception: Size of sample #81 is invalid (=(1024, 1045)) since max_positions=(1024, 1024), skip this example with --skip-invalid-size-inputs-valid-test
    

    I didn't understand what (1024, 1045) and (1024, 1024) mean. I'm using default 510 for training and 9999 for testing as below -

    if [[ $SPLIT == 'test' ]]; then
            MAX_LEN=9999 # we do not truncate test sequences
        else
            MAX_LEN=510
    

    Could you plz suggest how to proceed further..?

    bug help wanted 
    opened by saichandrapandraju 4
  • Confused about the

    Confused about the "max-sentences" in pretraining

    Hi,

    In the pretraining script, you set the max-sentences to 32. Max-sentences is per GPU, so PER_GPU_TRAIN_BATCH_SIZE is 32. But the "max-tokens" is 2048 and the "tokens-per-sample" is 512, , so PER_GPU_TRAIN_BATCH_SIZE is 4. Why are these two parameters conflicted?

    Thanks

    question 
    opened by freexxxyyy 3
  • Missing

    Missing "java" token in Hugging Face Tokenizer

    Hi,

    I am trying to replicate the results of PLBART for the code refinement fine-tuning task using Hugging Face. When I tokenize methods that contain the "java" token and then decode them, the "java" token is strangely removed! Here is my code:

    code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
    tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base")
    model_inputs = tokenizer([code])
    print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
    # The code output is: "public void METHOD_1 ( TYPE_1 VAR_1 ) throws .lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
    

    Also, is there any hugging face implementation of the code refinement task using PLBART? My implementation does not achieve the EM and BLEU reported for the test set. I executed the existing fairseq implementation and got EM: 17.67​, however my hugging face implementation gets EM: 5.62! What important factors should I check?

    bug 
    opened by Ahmadreza-SY 3
  • Can we add structural information in PLBART?

    Can we add structural information in PLBART?

    I know this is not the perfect question to add into this repo. I understand PLABART is based on a transformer architecture and deal with sequence based text stream.

    But would be curious to know whether we can embed structural information like AST level edges in PLBART.

    opened by smith-co 3
  • unzip unsuccessful when running download.sh

    unzip unsuccessful when running download.sh

    I tried to fine tune the code refinement task on the PLBART paper, I set up the conda environment by bash install_env.sh, then download the checkpoints. However, when I run bash download.sh under data/codeXglue, I got this which seems that either unzip or the download was unccessful.

    image

    Am I missing some steps in the setup?

    opened by chungen04 3
Owner
Wasi Ahmad
I am a Ph.D. student in CS at UCLA.
Wasi Ahmad
source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval This repository contains source code and pre-trained/fine-tun

Siqi 65 Dec 26, 2022
Code for our paper "Graph Pre-training for AMR Parsing and Generation" in ACL2022

AMRBART An implementation for ACL2022 paper "Graph Pre-training for AMR Parsing and Generation". You may find our paper here (Arxiv). Requirements pyt

xfbai 60 Jan 3, 2023
Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

AVATAR Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation. AVATAR stands for jAVA-pyThon progrAm tRanslation. AV

Wasi Ahmad 26 Dec 3, 2022
Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

null 137 Dec 15, 2022
Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

Liangming Pan 70 Nov 27, 2022
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 8, 2022
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

Isen (Songyao Jiang) 128 Dec 8, 2022
Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21'

Argument Extraction by Generation Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21' Dependencies pytorch=1.6 tr

Zoey Li 87 Dec 26, 2022
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 341 Dec 29, 2022
Self-training with Weak Supervision (NAACL 2021)

This repo holds the code for our weak supervision framework, ASTRA, described in our NAACL 2021 paper: "Self-Training with Weak Supervision"

Microsoft 148 Nov 20, 2022
Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

Microsoft 282 Jan 9, 2023
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPR model then continuing training with RAG.

KGI (Knowledge Graph Induction) for slot filling This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code fo

International Business Machines 72 Jan 6, 2023
Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

LongDocSum Code for NAACL 2021 paper "Efficient Attentions for Long Document Summarization" This repository contains data and models needed to reprodu

null 56 Jan 2, 2023
Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

THUNLP 37 Oct 30, 2022
The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

Sun Yi 201 Nov 21, 2022
Code of our paper "Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning"

CCOP Code of our paper Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning Requirement Install OpenSelfSup Install Detectron2

Chenhongyi Yang 21 Dec 13, 2022
The official pytorch implementation of our paper "Is Space-Time Attention All You Need for Video Understanding?"

TimeSformer This is an official pytorch implementation of Is Space-Time Attention All You Need for Video Understanding?. In this repository, we provid

Facebook Research 1k Dec 31, 2022
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 2, 2022