Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Related tags

Deep Learning AVATAR
Overview

AVATAR

  • Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
  • AVATAR stands for jAVA-pyThon progrAm tRanslation.
  • AVATAR is a corpus of 8,475 programming problems and their solutions written in Java and Python.
  • Supervised fine-tuning and evaluation in terms of Computational Accuracy, see details here.

Table of Contents

Dataset

We have collected the programming problems and their solutions from competitive programming sites, online platforms, and open source repositories. We list the sources below.

  • CodeForces
  • AtCoder
  • CodeJam
  • GeeksforGeeks
  • LeetCode
  • ProjectEuler

Data collected can be downloaded by following:

cd data
bash download.sh

To prepare the data, we perform the following steps.

  • Removing docstrings, comments, etc.
  • Use baseline models' tokenizer to perform tokenization.
  • Filter data based on length threshold (~512).
  • Perform de-duplication. (remove examples that are duplicates)

To perform the preparation, run:

cd data
bash prepare.sh

Models

We studied 8 models for program translation.

Models trained from scratch

Pre-trained models

Training & Evaluation

To train and evaluate a model, go to the corresponding model directory and execute the run.sh script.

# Seq2Seq+Attn.
cd seq2seq
bash rnn.sh GPU_ID LANG1 LANG2

# Transformer
cd seq2seq
bash transformer.sh GPU_ID LANG1 LANG2

# CodeGPT
cd codegpt
bash run.sh GPU_ID LANG1 LANG2 CodeGPT

# CodeGPT-adapted
cd codegpt
bash run.sh GPU_ID LANG1 LANG2

# CodeBERT
cd codebert
bash run.sh GPU_ID LANG1 LANG2

# GraphCoderBERT
cd graphcodebert
bash run.sh GPU_ID LANG1 LANG2

# PLBART
cd plbart
# fine-tuning either for Java->Python or Python-Java
bash run.sh GPU_ID LANG1 LANG2
# multilingual fine-tuning
bash multilingual.sh GPU_ID

# Naive Copy
cd naivecopy
bash run.sh
  • Here, LANG1 LANG2=Java Python or LANG1 LANG2=Python Java.
  • Download pre-trained PLBART, GraphCodeBERT, and Transcoder model files by running download.sh script.
  • We trained the models on GeForce RTX 2080 ti GPUs (11019MiB).

Benchmarks

We evaluate the models' performances on the test set in terms of Compilation Accuracy (CA), BLEU, Syntax Match (SM), Dataflow Match (DM), CodeBLEU (CB), Exact Match (EM). We report the model performances below.

Training Models Java to Python Python to Java
CA BLEU SM DM CB EM CA BLEU SM DM CB EM
None Naive Copy - 23.4 - - - 0.0 - 26.9 - - - 0.0
TransCoder 76.9 36.8 31.0 17.1 29.1 0.1 100 49.4 37.6 18.5 31.9 0.0
TC-DOBF 77.7 43.4 29.7 33.9 34.8 0.0 100 46.1 36.0 12.6 28.8 0.0
From Scratch Seq2Seq+Attn. 66.5 56.3 39.1 18.4 37.9 1.0 71.8 62.7 46.6 28.5 43.0 0.8
Transformer 61.5 38.9 34.2 16.5 29.1 0.0 67.4 45.6 45.7 26.4 37.4 0.1
Pre-trained CodeGPT 47.3 38.2 32.5 11.5 26.1 1.1 71.2 44.0 38.8 26.7 33.8 0.1
CodeGPT-adapted 48.1 38.2 32.5 12.1 26.2 1.2 68.6 42.4 37.2 27.2 33.1 0.5
CodeBERT 62.3 59.3 37.7 16.2 36.7 0.5 74.7 55.3 38.4 22.5 36.1 0.6
GraphCodeBERT 65.7 59.7 38.9 16.4 37.1 0.7 57.2 60.6 48.4 20.6 40.1 0.4
PLBARTmono 76.4 67.1 42.6 19.3 43.3 2.4 34.4 69.1 57.1 34.0 51.4 1.2
PLBARTmulti 70.4 67.1 42.0 17.6 42.4 2.4 30.8 69.4 56.6 34.5 51.8 1.0

License

This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, see the LICENSE file for details.

Citation

@article{ahmad-etal-2021-avatar,
  title={AVATAR: A Parallel Corpus for Java-Python Program Translation},
  author={Ahmad, Wasi Uddin and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
  journal={arXiv preprint arXiv:2108.11590},
  year={2021}
}
Comments
  • Two questions about the AVATAR paper

    Two questions about the AVATAR paper

    Hello,

    Thank you very much for the great work!

    In order to evaluate some the Code-ML models, I have two questions regarding the AVATAR paper (https://arxiv.org/pdf/2108.11590.pdf) :

    (1) In Section 2: "To train models, we chose a maximum of k (we set k to 3 based on validation performances) solutions in each language to form a maximum of k^2 training examples and consider all the accepted solutions as reference translations for validation and testing." I am wondering how exactly these k^2 pairs are selected during training ? Assume there are 5 candidates in each language (JAVA/Python) for each problem, there are 25 pairs that one can choose from -- when k is set to 3, that means we ought to choose 9 out of these 25 pairs, I don't quite understand how exactly it is done as there are many ways to choose 9 pairs out of 25 pairs. Could you point to where in the code it is done ? Also, to validate and test, does it mean as long as the translated code matches with any one of the python reference targets (could be as many as 5), it should be counted as correct ?

    (2) The caption of Table 2 reads: CA stands for Computational Accuracy. Does CA actually stand for Compilation Accuracy ?

    Thanks!

    Wei

    question 
    opened by weidotwisc 6
  • eval_bleu with pretrained gpt model

    eval_bleu with pretrained gpt model

    Hi @wasiahmad, I'm trying to evaluate a gpt-2 model with your code. Thus, I run run.py with microsoft/CodeGPT-small-py in pretrain_dir parameter and do_infer. In eval_blue script outputs equal to model(inputs)[1] – these are hidden states of pretrained gpt – and it's a tuple of 12 elements (n_layers) consisting of 2 elements each, and these two have [1, 12, 48, 64]. When it goes to this line past_hidden = [x[:, i:i + 1].expand(-1, beam_size, -1, -1, -1) for x in outputs] an error occurs: TypeError: tuple indices must be integers or slices, not tuple – and it also implies that the shape of each element in outputs should have 5 dimensions. Which corrections should be done in this case?

    question 
    opened by Gugutse 5
  • evaluating TransCoder on AVATAR test set and bug in compile.py?

    evaluating TransCoder on AVATAR test set and bug in compile.py?

    Hi @wasiahmad, I am using https://github.com/wasiahmad/AVATAR/blob/main/transcoder/run.sh to evaluate transcoder on AVATAR test data (test.java-python.java, test.java-python.java and test.jsonl ), 1699 samples. The scores do not match exactly with the paper. Could you please tell me if I am missing anything.

    Also I think the success and error variables are swapped in the below line in format function due to which the current output for Python to Java is Success - 1699, Errors - 0 instead Success - 0, Errors - 1699. In that case the CA is 0 for Python to Java. Please correct me if I am wrong. https://github.com/wasiahmad/AVATAR/blob/7bcabc9c7ff82190e878ed0fd3d93e9eb86fea2e/evaluation/compile.py#L90

    Thanks and Regards, Kunal Pagarey

    bug 
    opened by kunalpagarey 3
  • How did you select the samples in g4g-functions?

    How did you select the samples in g4g-functions?

    Hi! Sorry to bother you again. I found that there are 5132 geeksforgeeks samples in the whole AVATAR dataset, but only 3411 samples in g4g_functions. How did you select these 3411 samples? Did you filter out the problems in TransCoder-Eval?

    image

    Thank you!

    opened by Veronicium 2
  • Absent new_lines and indentation in python data

    Absent new_lines and indentation in python data

    Hi!

    I downloaded data from AVATAR/data/data.zip and also using script AVATAR/data/download.sh, and it seems that a lot of python functions in the dataset miss new_lines and indentation. For example CodeForces/421/A/solution1.py:

    n, a, b = map(int, input().split())athur = map(int, input().split())alex = map(int, input().split()) total = [1] * n for i in alex:    total[i-1] = 2 print(*total)
    

    or CodeForces/981/A/solution1.py:

    s=input()c=len(s)for i in range(len(s)-1,0,-1):    k=s[0:i+1]    if(k!=k[::-1]):        print(c)        exit()    c-=1if(c==1):    print("0")
    

    According to my simple heuristic calculation, about 50% of python functions look like this.

    Is there way to fix it? Thanks in advance for your help!

    bug 
    opened by nadiinchi 2
  • Unable to reproduce CodeT5 results

    Unable to reproduce CodeT5 results

    ----------------------------------- Edit: able to reproduce results using different bs / lr ---------------------------------------------

    image

    Hi, I was trying to finetune CodeT5 for java-python translation but cannot reproduce the results. I got BLEU 63.77 / EM 1.88, which is much lower than your reported results: BLEU 67.0 / EM 2.8. The hyper-params are:

    TRAIN_BATCH_SIZE=2 GRAD_ACCUM_STEP=16 LR=5e-5 NUM_TRAIN_EPOCHS=20 tokenizer_path=codet5/bpe; source_length=510 target_length=510 --do_eval_bleu

    All other hyper-params are as default in https://github.com/wasiahmad/AVATAR/blob/main/codet5/run.sh

    Could you give me some suggestions on how to reproduce your results? Thank you!

    opened by Veronicium 1
  • download error of the pretrained ckpts you shared

    download error of the pretrained ckpts you shared

    Hi! I was trying to download the pretrained CodeT5-base and PLBART models by running download.sh, but I got the awk: fatal: cannot open file './cookie' for reading (No such file or directory) error. I also tried to directly open the download link in my web browser but it shows "404 file not found". Are you using the same pretrained model as in huggingface (i.e., Salesforce/codet5-base)?

    Also, will you release the finetuned ckpts of those models?

    Thank you!

    opened by Veronicium 1
  • Typo Bug: Undefined 'tag' in DFG_java and DFG_csharp

    Typo Bug: Undefined 'tag' in DFG_java and DFG_csharp

    https://github.com/wasiahmad/AVATAR/blob/bc9257091eb5125376eec6fb305faac1b71f9f5c/evaluation/CodeBLEU/parser/DFG.py#L254

    Hi, I am working on a project that makes uses of your DFG_java code in the linked file, I found that there is a 'referenced before defined' error in this code. There should be another line define tag=False after flag=False.

    I also found that other versions of DFG_java in this repository has the tag=False line defined, so I think this is just a typo. (Also, I didn't make use of DFG_csharp but the DFG_csharp in this file evaluation/CodeBLEU/parser/DFG.py seem to be missing the tag initialization as well.)

    Thanks for the good work!

    Best

    opened by ziwenyd 1
  • Preprocessing

    Preprocessing

    Hi, can you please let me know how you preprocessed your data to run on the Transcoder-DOBF model? A few brief steps would really be helpful.

    Thank you in advance!

    help wanted 
    opened by SwathiDoddaballapurNagaraja 1
  • Could you please help provide the trained model parameters?

    Could you please help provide the trained model parameters?

    Hi @wasiahmad , I wonder if it’s possible for you to share the trained models?Since l am in urgent need of evaluating the experimental effects of these models on datasets (such as AVATAR and G4G), including assigned specific codes after translation, but my GPU is in short supply. Thank you in advance!

    help wanted 
    opened by zxniubihhh 1
Owner
Wasi Ahmad
I am a Ph.D. student in CS at UCLA.
Wasi Ahmad
Compute execution plan: A DAG representation of work that you want to get done. Individual nodes of the DAG could be simple python or shell tasks or complex deeply nested parallel branches or embedded DAGs themselves.

Hello from magnus Magnus provides four capabilities for data teams: Compute execution plan: A DAG representation of work that you want to get done. In

null 12 Feb 8, 2022
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

Isen (Songyao Jiang) 128 Dec 8, 2022
Make your master artistic punk avatar through machine learning world famous paintings.

Master-art-punk Make your master artistic punk avatar through machine learning world famous paintings. 通过机器学习世界名画制作属于你的大师级艺术朋克头像 Nowadays, NFT is beco

Philipjhc 53 Dec 27, 2022
Vrcwatch - Supply the local time to VRChat as Avatar Parameters through OSC

English: README-EN.md VRCWatch VRCWatch は、VRChat 内のアバター向けに現在時刻を送信するためのプログラムです。 使

Kosaki Mezumona 17 Nov 30, 2022
3D Avatar Lip Syncronization from speech (JALI based face-rigging)

visemenet-inference Inference Demo of "VisemeNet-tensorflow" VisemeNet is an audio-driven animator centric speech animation driving a JALI or standard

Junhwan Jang 17 Dec 20, 2022
Official repository for "Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems"

Action-Based Conversations Dataset (ABCD) This respository contains the code and data for ABCD (Chen et al., 2021) Introduction Whereas existing goal-

ASAPP Research 49 Oct 9, 2022
This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

Self-Diagnosis and Self-Debiasing This repository contains the source code for Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based

Timo Schick 62 Dec 12, 2022
Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis in JAX

SYMPAIS: Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis Overview | Installation | Documentation | Examples | Notebo

Yicheng Luo 4 Sep 13, 2022
This repository contains the source code of our work on designing efficient CNNs for computer vision

Efficient networks for Computer Vision This repo contains source code of our work on designing efficient networks for different computer vision tasks:

Sachin Mehta 386 Nov 26, 2022
This repository contains the entire code for our work "Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid Precoding"

Two-Timescale-DNN Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid Precoding This repository contains the entire code for our work

QiyuHu 3 Mar 7, 2022
Parallel and High-Fidelity Text-to-Lip Generation; AAAI 2022 ; Official code

Parallel and High-Fidelity Text-to-Lip Generation This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose P

Zhying 77 Dec 21, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 226 Dec 29, 2022
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022
[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

CONQUER: Contexutal Query-aware Ranking for Video Corpus Moment Retreival PyTorch implementation of CONQUER: Contexutal Query-aware Ranking for Video

Hou zhijian 23 Dec 26, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Dec 31, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 4, 2023
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 20.6k Feb 13, 2021
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 5.7k Feb 12, 2021
OptaPlanner wrappers for Python. Currently significantly slower than OptaPlanner in Java or Kotlin.

OptaPy is an AI constraint solver for Python to optimize the Vehicle Routing Problem, Employee Rostering, Maintenance Scheduling, Task Assignment, School Timetabling, Cloud Optimization, Conference Scheduling, Job Shop Scheduling, Bin Packing and many more planning problems.

OptaPy 211 Jan 2, 2023