Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Wasi Ahmad

Last update: Dec 30, 2022

Related tags

Text Data & NLP PLBART

Overview

PLBART

Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021.

Note. A detailed documentation is coming soon.

Pre-training data

PLBART is pre-trained on Java and Python functions and natural language descriptions collected from Github and StackOverflow.

Evaluation tasks

We evaluated PLBART on five tasks.

Code summarization [REF]
Code generation [REF]
Code translation [REF]
Clone detection [REF]
Vulnerability REF [REF]

Notes

We will publish the pretrained PLBART checkpoint soon.
We list all the files in this repository here.

Acknowledgement

PLBART uses Fairseq, codeXglue, and TransCoder and thanks the authors of these works for their contribution.

Citation

@inproceedings{ahmad2020summarization,
    author = {Ahmad, Wasi Uddin and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
    booktitle = {Proceedings of the 2021 Conference of the North {A}merican Chapter of the Association for Computational Linguistics},
    title = {Unified Pre-training for Program Understanding and Generation},
    year = {2021}
}

Comments

Multilingual `prepare.sh` throws an error after downloading

While running prepare.sh the followign errors are thrown for all the languages in multilingual directory:

FileNotFoundError: [Errno 2] No such file or directory: '/home/crocoder/Desktop/transformers/PLBART/multilingual/data/processed/valid.php-en_XX.php'
Traceback (most recent call last):
  File "encode.py", line 92, in <module>
    main()
  File "encode.py", line 88, in main
    process(args)
  File "encode.py", line 49, in process
    with open(args.input_source, 'r', encoding='utf-8') as f1, \
FileNotFoundError: [Errno 2] No such file or directory: '/home/crocoder/Desktop/transformers/PLBART/multilingual/data/processed/test.php-en_XX.php'

Any help on this?

bug

opened by gchhablani 8

questions about dict.txt and data samples specify methods
which step generates the dict.txt? It seems be generated during "fairseq-preprocess", but "fairseq-preprocess" also have a parameter "--srcdict $DICT_FILE".

how do you make the machine know the end of a data sample(which is a function in this case)? It seems that you use "\n", but functions also have "\n", I am confused about this.

Also, if fairseq-train FILENAME_OF_FIRST_DATASAMPLE: FILENAME_OF_SECOND_DATASAMPLE : FILENAME_OF_THIRD_DATASAMPLE:.....:FILENAME_OF_NTH_DATASAMPLE, will it work?

I am new to this. Thanks.
opened by freexxxyyy 6
Mismatch when loading the checkpoints

Hi, thanks for your great work!

When I tried to load the pre-trained checkpoints and fine tune, I came across the size mismatch problem. It seems that the dict.txt you provided does not match the checkpoints.

Here is the error message:

size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([50005, 768]) from checkpoint, the shape in current model is torch.Size([50001, 768]). size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([50005, 768]) from checkpoint, the shape in current model is torch.Size([50001, 768]). size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([50005, 768]) from checkpoint, the shape in current model is torch.Size([50001, 768]).

This is the script I used to get the checkpoints: https://github.com/wasiahmad/PLBART/blob/main/pretrain/download.sh

This is the dict.txt I used: https://github.com/wasiahmad/PLBART/blob/main/sentencepiece/dict.txt

Here is the command I used to fine tune: fairseq-train $PATH_2_DATA \ --user-dir $USER_DIR --truncate-source \ --arch mbart_base --layernorm-embedding \ --task translation \ --source-lang $SOURCE --target-lang $TARGET \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --batch-size $BATCH_SIZE --update-freq $UPDATE_FREQ --max-epoch 30 \ --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \ --lr-scheduler polynomial_decay --lr 5e-05 --min-lr -1 \ --warmup-updates 500 --max-update 100000 \ --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.0 \ --seed 1234 --log-format json --log-interval 100 \ ${restore} \ --eval-bleu --eval-bleu-detok space --eval-tokenized-bleu \ --eval-bleu-remove-bpe sentencepiece --eval-bleu-args '{"beam": 5}' \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \ --no-epoch-checkpoints --patience 5 \ --ddp-backend no_c10d --save-dir $SAVE_DIR 2>&1 | tee ${OUTPUT_FILE};
question

opened by JiyangZhang 6

AssertionError while evaluating 'translation'

Hi @wasiahmad ,

I am trying 'translation' capabilities of PLBART and started finetuning as mentioned. But I'm getting below error in evaluation -

File "calc_code_bleu.py", line 34, in <module>
    assert len(hypothesis) == len(pre_references[i])
AssertionError

Here is a bit detailed traceback -

2021-07-15 13:57:58 | INFO | fairseq_cli.train | early stop since valid performance hasn't improved for last 10 runs
2021-07-15 13:57:58 | INFO | fairseq_cli.train | begin save checkpoint
2021-07-15 13:58:19 | INFO | fairseq.checkpoint_utils | saved checkpoint /content/PLBART/scripts/code_to_code/translation/java_cs/checkpoint_last.pt (epoch 22 @ 14168 updates, score 80.08) (writing took 20.417050701000335 seconds)
2021-07-15 13:58:19 | INFO | fairseq_cli.train | end of epoch 22 (average epoch stats below)
2021-07-15 13:58:19 | INFO | train | {"epoch": 22, "train_loss": "2.08", "train_nll_loss": "0.177", "train_ppl": "1.13", "train_wps": "1562.9", "train_ups": "1.76", "train_wpb": "890.1", "train_bsz": "16", "train_num_updates": "14168", "train_lr": "4.93409e-05", "train_gnorm": "0.534", "train_train_wall": "255", "train_wall": "8414"}
2021-07-15 13:58:19 | INFO | fairseq_cli.train | done training in 8412.8 seconds
  0% 0/250 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py:172: UserWarning: --sacrebleu is deprecated. Please use --scoring sacrebleu instead.
  scorer = scoring.build_scorer(args, tgt_dict)
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-generate", line 8, in <module>
    sys.exit(cli_main())
  File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py", line 379, in cli_main
    main(args)
  File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py", line 41, in main
    return _main(args, sys.stdout)
  File "/usr/local/lib/python3.7/dist-packages/fairseq_cli/generate.py", line 172, in _main
    scorer = scoring.build_scorer(args, tgt_dict)
  File "/usr/local/lib/python3.7/dist-packages/fairseq/scoring/__init__.py", line 54, in build_scorer
    return _build_scorer(args)
  File "/usr/local/lib/python3.7/dist-packages/fairseq/registry.py", line 54, in build_x
    return builder(args, *extra_args, **extra_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/fairseq/scoring/bleu.py", line 40, in __init__
    character_tokenization=self.args.sacrebleu_char_level,
AttributeError: 'Namespace' object has no attribute 'sacrebleu_char_level'
Traceback (most recent call last):
  File "/content/PLBART/evaluation/evaluator.py", line 36, in <module>
    main()
  File "/content/PLBART/evaluation/evaluator.py", line 20, in main
    assert len(refs) == len(pres)
AssertionError
Traceback (most recent call last):
  File "calc_code_bleu.py", line 34, in <module>
    assert len(hypothesis) == len(pre_references[i])
AssertionError

Could you plz suggest how to proceed further..

help wanted question

opened by saichandrapandraju 6

AttributeError: module 'sentencepiece' has no attribute 'SentencePieceProcessor'

Firstly, I would like to thank you for your release and documentation. I am fine-tuning the text-to-code model and when I run the "scripts / text_to_code / prepare.sh" file I have the following error in the file "scripts/text_to_code/encode.py", line 30" :

AttributeError: module 'sentencepiece' has no attribute 'SentencePieceProcessor'

Any idea please ?
question

opened by tmarrakchi 5
Script for training PLBART

Hi,

Thanks to the great contribution in code Generation. We would like to discover your PLBART model. Could you share with us the script for training PLBART for code generation?

Thank you so much
documentation help wanted

opened by tmarrakchi 5

Size of sample is invalid since max_positions=(1024, 1024)

Hi @wasiahmad , I trained PLBART for JAVA -> PYTHON translation. But while testing, I was getting below error -

2021-07-21 05:31:11 | INFO | train | {"epoch": 30, "train_loss": "2.69", "train_nll_loss": "0.723", "train_ppl": "1.65", "train_wps": "7795.4", "train_ups": "0.35", "train_wpb": "22402", "train_bsz": "58.2", "train_num_updates": "240", "train_lr": "1.2e-05", "train_gnorm": "0.607", "train_train_wall": "5", "train_wall": "638"}
2021-07-21 05:31:11 | INFO | fairseq_cli.train | done training in 637.0 seconds
Traceback (most recent call last):
  File "/home/jovyan/.local/bin/fairseq-generate", line 8, in <module>
    sys.exit(cli_main())
  File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 379, in cli_main
    main(args)
  File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 41, in main
    return _main(args, sys.stdout)
  File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 132, in _main
    itr = task.get_batch_iterator(
  File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 227, in get_batch_iterator
    indices = self.filter_indices_by_size(
  File "/home/jovyan/.local/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 137, in filter_indices_by_size
    raise Exception(
Exception: Size of sample #81 is invalid (=(1024, 1045)) since max_positions=(1024, 1024), skip this example with --skip-invalid-size-inputs-valid-test

I didn't understand what (1024, 1045) and (1024, 1024) mean. I'm using default 510 for training and 9999 for testing as below -

if [[ $SPLIT == 'test' ]]; then
        MAX_LEN=9999 # we do not truncate test sequences
    else
        MAX_LEN=510

Could you plz suggest how to proceed further..?

bug help wanted

opened by saichandrapandraju 4

Confused about the "max-sentences" in pretraining

Hi,

In the pretraining script, you set the max-sentences to 32. Max-sentences is per GPU, so PER_GPU_TRAIN_BATCH_SIZE is 32. But the "max-tokens" is 2048 and the "tokens-per-sample" is 512, , so PER_GPU_TRAIN_BATCH_SIZE is 4. Why are these two parameters conflicted?

Thanks
question

opened by freexxxyyy 3
Missing "java" token in Hugging Face Tokenizer
Hi,

I am trying to replicate the results of PLBART for the code refinement fine-tuning task using Hugging Face. When I tokenize methods that contain the "java" token and then decode them, the "java" token is strangely removed! Here is my code:

code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }" tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base") model_inputs = tokenizer([code]) print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False)) # The code output is: "public void METHOD_1 ( TYPE_1 VAR_1 ) throws .lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"

Also, is there any hugging face implementation of the code refinement task using PLBART? My implementation does not achieve the EM and BLEU reported for the test set. I executed the existing fairseq implementation and got EM: 17.67, however my hugging face implementation gets EM: 5.62! What important factors should I check?
bug
opened by Ahmadreza-SY 3
Can we add structural information in PLBART?

I know this is not the perfect question to add into this repo. I understand PLABART is based on a transformer architecture and deal with sequence based text stream.

But would be curious to know whether we can embed structural information like AST level edges in PLBART.

opened by smith-co 3
unzip unsuccessful when running download.sh

I tried to fine tune the code refinement task on the PLBART paper, I set up the conda environment by bash install_env.sh, then download the checkpoints. However, when I run bash download.sh under data/codeXglue, I got this which seems that either unzip or the download was unccessful.

Am I missing some steps in the setup?

opened by chungen04 3

Owner

Wasi Ahmad

I am a Ph.D. student in CS at UCLA.

GitHub

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

16 Sep 9, 2022

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

83 Jan 9, 2023

MASS: Masked Sequence to Sequence Pre-training for Language Generation

1.1k Dec 17, 2022

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

16 Nov 12, 2022

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

9 Jun 27, 2022

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Related tags

Overview

PLBART

Pre-training data

Evaluation tasks

Notes

Acknowledgement

Citation

Comments

Owner

Wasi Ahmad

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Pre-training BERT masked language models with custom vocabulary

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

iBOT: Image BERT Pre-Training with Online Tokenizer

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers