AVATAR
- Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
- AVATAR stands for jAVA-pyThon progrAm tRanslation.
- AVATAR is a corpus of 8,475 programming problems and their solutions written in Java and Python.
- Supervised fine-tuning and evaluation in terms of Computational Accuracy, see details here.
Table of Contents
Dataset
We have collected the programming problems and their solutions from competitive programming sites, online platforms, and open source repositories. We list the sources below.
- CodeForces
- AtCoder
- CodeJam
- GeeksforGeeks
- LeetCode
- ProjectEuler
Data collected can be downloaded by following:
cd data
bash download.sh
To prepare the data, we perform the following steps.
- Removing docstrings, comments, etc.
- Use baseline models' tokenizer to perform tokenization.
- Filter data based on length threshold (~512).
- Perform de-duplication. (remove examples that are duplicates)
To perform the preparation, run:
cd data
bash prepare.sh
Models
We studied 8 models for program translation.
Models trained from scratch
- Seq2Seq+Attn. [1Lx512H]
- Transformer [6Lx512H]
Pre-trained models
- CodeGPT
- CodeGPT-adapted
- CodeBERT
- GraphCoderBERT
- PLBART
- TransCoder (unsupervised approach)
Training & Evaluation
To train and evaluate a model, go to the corresponding model directory and execute the run.sh script.
# Seq2Seq+Attn.
cd seq2seq
bash rnn.sh GPU_ID LANG1 LANG2
# Transformer
cd seq2seq
bash transformer.sh GPU_ID LANG1 LANG2
# CodeGPT
cd codegpt
bash run.sh GPU_ID LANG1 LANG2 CodeGPT
# CodeGPT-adapted
cd codegpt
bash run.sh GPU_ID LANG1 LANG2
# CodeBERT
cd codebert
bash run.sh GPU_ID LANG1 LANG2
# GraphCoderBERT
cd graphcodebert
bash run.sh GPU_ID LANG1 LANG2
# PLBART
cd plbart
# fine-tuning either for Java->Python or Python-Java
bash run.sh GPU_ID LANG1 LANG2
# multilingual fine-tuning
bash multilingual.sh GPU_ID
# Naive Copy
cd naivecopy
bash run.sh
- Here,
LANG1 LANG2=Java Python
orLANG1 LANG2=Python Java
. - Download pre-trained PLBART, GraphCodeBERT, and Transcoder model files by running download.sh script.
- We trained the models on GeForce RTX 2080 ti GPUs (11019MiB).
Benchmarks
We evaluate the models' performances on the test set in terms of Compilation Accuracy (CA), BLEU, Syntax Match (SM), Dataflow Match (DM), CodeBLEU (CB), Exact Match (EM). We report the model performances below.
Training | Models | Java to Python | Python to Java | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CA | BLEU | SM | DM | CB | EM | CA | BLEU | SM | DM | CB | EM | ||
None | Naive Copy | - | 23.4 | - | - | - | 0.0 | - | 26.9 | - | - | - | 0.0 |
TransCoder | 76.9 | 36.8 | 31.0 | 17.1 | 29.1 | 0.1 | 100 | 49.4 | 37.6 | 18.5 | 31.9 | 0.0 | |
TC-DOBF | 77.7 | 43.4 | 29.7 | 33.9 | 34.8 | 0.0 | 100 | 46.1 | 36.0 | 12.6 | 28.8 | 0.0 | |
From Scratch | Seq2Seq+Attn. | 66.5 | 56.3 | 39.1 | 18.4 | 37.9 | 1.0 | 71.8 | 62.7 | 46.6 | 28.5 | 43.0 | 0.8 |
Transformer | 61.5 | 38.9 | 34.2 | 16.5 | 29.1 | 0.0 | 67.4 | 45.6 | 45.7 | 26.4 | 37.4 | 0.1 | |
Pre-trained | CodeGPT | 47.3 | 38.2 | 32.5 | 11.5 | 26.1 | 1.1 | 71.2 | 44.0 | 38.8 | 26.7 | 33.8 | 0.1 |
CodeGPT-adapted | 48.1 | 38.2 | 32.5 | 12.1 | 26.2 | 1.2 | 68.6 | 42.4 | 37.2 | 27.2 | 33.1 | 0.5 | |
CodeBERT | 62.3 | 59.3 | 37.7 | 16.2 | 36.7 | 0.5 | 74.7 | 55.3 | 38.4 | 22.5 | 36.1 | 0.6 | |
GraphCodeBERT | 65.7 | 59.7 | 38.9 | 16.4 | 37.1 | 0.7 | 57.2 | 60.6 | 48.4 | 20.6 | 40.1 | 0.4 | |
PLBARTmono | 76.4 | 67.1 | 42.6 | 19.3 | 43.3 | 2.4 | 34.4 | 69.1 | 57.1 | 34.0 | 51.4 | 1.2 | |
PLBARTmulti | 70.4 | 67.1 | 42.0 | 17.6 | 42.4 | 2.4 | 30.8 | 69.4 | 56.6 | 34.5 | 51.8 | 1.0 |
License
This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, see the LICENSE file for details.
Citation
@article{ahmad-etal-2021-avatar,
title={AVATAR: A Parallel Corpus for Java-Python Program Translation},
author={Ahmad, Wasi Uddin and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
journal={arXiv preprint arXiv:2108.11590},
year={2021}
}