Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Wasi Ahmad

Last update: Dec 3, 2022

Related tags

Deep Learning AVATAR

Overview

AVATAR

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
AVATAR stands for jAVA-pyThon progrAm tRanslation.
AVATAR is a corpus of 8,475 programming problems and their solutions written in Java and Python.
Supervised fine-tuning and evaluation in terms of Computational Accuracy, see details here.

AVATAR

Dataset

We have collected the programming problems and their solutions from competitive programming sites, online platforms, and open source repositories. We list the sources below.

CodeForces
AtCoder
CodeJam
GeeksforGeeks
LeetCode
ProjectEuler

Data collected can be downloaded by following:

cd data
bash download.sh

To prepare the data, we perform the following steps.

Removing docstrings, comments, etc.
Use baseline models' tokenizer to perform tokenization.
Filter data based on length threshold (~512).
Perform de-duplication. (remove examples that are duplicates)

To perform the preparation, run:

cd data
bash prepare.sh

Models

We studied 8 models for program translation.

Models trained from scratch

Seq2Seq+Attn. [1Lx512H]
Transformer [6Lx512H]

Pre-trained models

Training & Evaluation

To train and evaluate a model, go to the corresponding model directory and execute the run.sh script.

# Seq2Seq+Attn.
cd seq2seq
bash rnn.sh GPU_ID LANG1 LANG2

# Transformer
cd seq2seq
bash transformer.sh GPU_ID LANG1 LANG2

# CodeGPT
cd codegpt
bash run.sh GPU_ID LANG1 LANG2 CodeGPT

# CodeGPT-adapted
cd codegpt
bash run.sh GPU_ID LANG1 LANG2

# CodeBERT
cd codebert
bash run.sh GPU_ID LANG1 LANG2

# GraphCoderBERT
cd graphcodebert
bash run.sh GPU_ID LANG1 LANG2

# PLBART
cd plbart
# fine-tuning either for Java->Python or Python-Java
bash run.sh GPU_ID LANG1 LANG2
# multilingual fine-tuning
bash multilingual.sh GPU_ID

# Naive Copy
cd naivecopy
bash run.sh

Here, LANG1 LANG2=Java Python or LANG1 LANG2=Python Java.
Download pre-trained PLBART, GraphCodeBERT, and Transcoder model files by running download.sh script.
We trained the models on GeForce RTX 2080 ti GPUs (11019MiB).

Benchmarks

We evaluate the models' performances on the test set in terms of Compilation Accuracy (CA), BLEU, Syntax Match (SM), Dataflow Match (DM), CodeBLEU (CB), Exact Match (EM). We report the model performances below.

Training	Models	Java to Python						Python to Java
Training	Models	CA	BLEU	SM	DM	CB	EM	CA	BLEU	SM	DM	CB	EM
None	Naive Copy	-	23.4	-	-	-	0.0	-	26.9	-	-	-	0.0
	TransCoder	76.9	36.8	31.0	17.1	29.1	0.1	100	49.4	37.6	18.5	31.9	0.0
	TC-DOBF	77.7	43.4	29.7	33.9	34.8	0.0	100	46.1	36.0	12.6	28.8	0.0
From Scratch	Seq2Seq+Attn.	66.5	56.3	39.1	18.4	37.9	1.0	71.8	62.7	46.6	28.5	43.0	0.8
From Scratch	Transformer	61.5	38.9	34.2	16.5	29.1	0.0	67.4	45.6	45.7	26.4	37.4	0.1
Pre-trained	CodeGPT	47.3	38.2	32.5	11.5	26.1	1.1	71.2	44.0	38.8	26.7	33.8	0.1
	CodeGPT-adapted	48.1	38.2	32.5	12.1	26.2	1.2	68.6	42.4	37.2	27.2	33.1	0.5
	CodeBERT	62.3	59.3	37.7	16.2	36.7	0.5	74.7	55.3	38.4	22.5	36.1	0.6
	GraphCodeBERT	65.7	59.7	38.9	16.4	37.1	0.7	57.2	60.6	48.4	20.6	40.1	0.4
	PLBART_mono	76.4	67.1	42.6	19.3	43.3	2.4	34.4	69.1	57.1	34.0	51.4	1.2
	PLBART_multi	70.4	67.1	42.0	17.6	42.4	2.4	30.8	69.4	56.6	34.5	51.8	1.0

License

This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, see the LICENSE file for details.

Citation

@article{ahmad-etal-2021-avatar,
  title={AVATAR: A Parallel Corpus for Java-Python Program Translation},
  author={Ahmad, Wasi Uddin and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
  journal={arXiv preprint arXiv:2108.11590},
  year={2021}
}

Comments

Two questions about the AVATAR paper

Hello,

Thank you very much for the great work!

In order to evaluate some the Code-ML models, I have two questions regarding the AVATAR paper (https://arxiv.org/pdf/2108.11590.pdf) :

(1) In Section 2: "To train models, we chose a maximum of k (we set k to 3 based on validation performances) solutions in each language to form a maximum of k^2 training examples and consider all the accepted solutions as reference translations for validation and testing." I am wondering how exactly these k^2 pairs are selected during training ? Assume there are 5 candidates in each language (JAVA/Python) for each problem, there are 25 pairs that one can choose from -- when k is set to 3, that means we ought to choose 9 out of these 25 pairs, I don't quite understand how exactly it is done as there are many ways to choose 9 pairs out of 25 pairs. Could you point to where in the code it is done ? Also, to validate and test, does it mean as long as the translated code matches with any one of the python reference targets (could be as many as 5), it should be counted as correct ?

(2) The caption of Table 2 reads: CA stands for Computational Accuracy. Does CA actually stand for Compilation Accuracy ?

Thanks!

Wei
question

opened by weidotwisc 6
eval_bleu with pretrained gpt model

Hi @wasiahmad, I'm trying to evaluate a gpt-2 model with your code. Thus, I run run.py with microsoft/CodeGPT-small-py in pretrain_dir parameter and do_infer. In eval_blue script outputs equal to model(inputs)[1] – these are hidden states of pretrained gpt – and it's a tuple of 12 elements (n_layers) consisting of 2 elements each, and these two have [1, 12, 48, 64]. When it goes to this line past_hidden = [x[:, i:i + 1].expand(-1, beam_size, -1, -1, -1) for x in outputs] an error occurs: TypeError: tuple indices must be integers or slices, not tuple – and it also implies that the shape of each element in outputs should have 5 dimensions. Which corrections should be done in this case?
question

opened by Gugutse 5
evaluating TransCoder on AVATAR test set and bug in compile.py?

Hi @wasiahmad, I am using https://github.com/wasiahmad/AVATAR/blob/main/transcoder/run.sh to evaluate transcoder on AVATAR test data (test.java-python.java, test.java-python.java and test.jsonl ), 1699 samples. The scores do not match exactly with the paper. Could you please tell me if I am missing anything.

Also I think the success and error variables are swapped in the below line in format function due to which the current output for Python to Java is Success - 1699, Errors - 0 instead Success - 0, Errors - 1699. In that case the CA is 0 for Python to Java. Please correct me if I am wrong. https://github.com/wasiahmad/AVATAR/blob/7bcabc9c7ff82190e878ed0fd3d93e9eb86fea2e/evaluation/compile.py#L90

Thanks and Regards, Kunal Pagarey
bug

opened by kunalpagarey 3
How did you select the samples in g4g-functions?

Hi! Sorry to bother you again. I found that there are 5132 geeksforgeeks samples in the whole AVATAR dataset, but only 3411 samples in g4g_functions. How did you select these 3411 samples? Did you filter out the problems in TransCoder-Eval?

Thank you!

opened by Veronicium 2
Absent new_lines and indentation in python data
Hi!

I downloaded data from AVATAR/data/data.zip and also using script AVATAR/data/download.sh, and it seems that a lot of python functions in the dataset miss new_lines and indentation. For example CodeForces/421/A/solution1.py:

n, a, b = map(int, input().split())athur = map(int, input().split())alex = map(int, input().split()) total = [1] * n for i in alex: total[i-1] = 2 print(*total)

or CodeForces/981/A/solution1.py:

s=input()c=len(s)for i in range(len(s)-1,0,-1): k=s[0:i+1] if(k!=k[::-1]): print(c) exit() c-=1if(c==1): print("0")

According to my simple heuristic calculation, about 50% of python functions look like this.

Is there way to fix it? Thanks in advance for your help!
bug
opened by nadiinchi 2
Unable to reproduce CodeT5 results

----------------------------------- Edit: able to reproduce results using different bs / lr ---------------------------------------------

Hi, I was trying to finetune CodeT5 for java-python translation but cannot reproduce the results. I got BLEU 63.77 / EM 1.88, which is much lower than your reported results: BLEU 67.0 / EM 2.8. The hyper-params are:

TRAIN_BATCH_SIZE=2 GRAD_ACCUM_STEP=16 LR=5e-5 NUM_TRAIN_EPOCHS=20 tokenizer_path=codet5/bpe; source_length=510 target_length=510 --do_eval_bleu

All other hyper-params are as default in https://github.com/wasiahmad/AVATAR/blob/main/codet5/run.sh

Could you give me some suggestions on how to reproduce your results? Thank you!

opened by Veronicium 1
download error of the pretrained ckpts you shared

Hi! I was trying to download the pretrained CodeT5-base and PLBART models by running download.sh, but I got the awk: fatal: cannot open file './cookie' for reading (No such file or directory) error. I also tried to directly open the download link in my web browser but it shows "404 file not found". Are you using the same pretrained model as in huggingface (i.e., Salesforce/codet5-base)?

Also, will you release the finetuned ckpts of those models?

Thank you!

opened by Veronicium 1
Typo Bug: Undefined 'tag' in DFG_java and DFG_csharp

https://github.com/wasiahmad/AVATAR/blob/bc9257091eb5125376eec6fb305faac1b71f9f5c/evaluation/CodeBLEU/parser/DFG.py#L254

Hi, I am working on a project that makes uses of your DFG_java code in the linked file, I found that there is a 'referenced before defined' error in this code. There should be another line define tag=False after flag=False.

I also found that other versions of DFG_java in this repository has the tag=False line defined, so I think this is just a typo. (Also, I didn't make use of DFG_csharp but the DFG_csharp in this file evaluation/CodeBLEU/parser/DFG.py seem to be missing the tag initialization as well.)

Thanks for the good work!

Best

opened by ziwenyd 1
Preprocessing

Hi, can you please let me know how you preprocessed your data to run on the Transcoder-DOBF model? A few brief steps would really be helpful.

Thank you in advance!
help wanted

opened by SwathiDoddaballapurNagaraja 1
Could you please help provide the trained model parameters?

Hi @wasiahmad , I wonder if it’s possible for you to share the trained models?Since l am in urgent need of evaluating the experimental effects of these models on datasets (such as AVATAR and G4G), including assigned specific codes after translation, but my GPU is in short supply. Thank you in advance!
help wanted

opened by zxniubihhh 1

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Related tags

Overview

AVATAR

Table of Contents

Dataset

Models

Models trained from scratch

Pre-trained models

Training & Evaluation

Benchmarks

License

Citation

Comments

Owner

Wasi Ahmad

Compute execution plan: A DAG representation of work that you want to get done. Individual nodes of the DAG could be simple python or shell tasks or complex deeply nested parallel branches or embedded DAGs themselves.

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Make your master artistic punk avatar through machine learning world famous paintings.

Vrcwatch - Supply the local time to VRChat as Avatar Parameters through OSC

3D Avatar Lip Syncronization from speech (JALI based face-rigging)

Official repository for "Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems"

This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis in JAX

This repository contains the source code of our work on designing efficient CNNs for computer vision

This repository contains the entire code for our work "Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid Precoding"

Parallel and High-Fidelity Text-to-Lip Generation; AAAI 2022 ; Official code

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

OptaPlanner wrappers for Python. Currently significantly slower than OptaPlanner in Java or Kotlin.