Transformers for variable misuse, function naming and code completion tasks
The official PyTorch implementation of:
- Empirical Study of Transformers for Source Code [arxiv] (accepted to ESEC/FSE'21)
- A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code [arxiv] (accepted to NAACL'21)
The repository also contains code for resplitting Python150k and JavaScript150k datasets (with splitting by repository, removing duplicates and the redistributable version of Py150k).
Repository structure
data_utils
: scripts for downloading Python150k and JavaScript150k datasets and obtaining new train / val / test splits (with splitting by repository, removing duplicates and the redistributable version of Py150k)vm_fn
: code for Variable Misuse (VM) and Function Naming (FN) tasks (additional preprocessing, models, training etc)cc
: code for Code Completion (CC) task (additional preprocessing, models, training etc)
See README in each directory for details.
Run
The code was tested on a system with Linux 3.10.0. Experiments were run using a Tesla V100 GPU. Required libraries are listed in requirments.txt
in VM_FN
and CC
directories. The implementation is based on PyTorch>=1.5.
Running experiments:
- Download and resplit data, see
data_utils
for details; - Preprocess data for a task you are interested in (VM, FN or CC), see
vm_fn
orcc
for details; - Run the experiment you are interested in, see
vm_fn
orcc
for details.
Attribution
Parts of this code are based on the following repositories:
- A Transformer-based Approach for Source Code Summarization
- Code Completion by Feeding Trees to Transformers
- A redistributable subset of the ETH Py150 corpus
- Deduplication index for big code datasets
- OpenNMT
- DrQA
Citation
If you found this code useful, please cite our papers
@misc{chirkova2020empirical,
title={Empirical Study of Transformers for Source Code},
author={Nadezhda Chirkova and Sergey Troshin},
year={2020},
eprint={2010.07987},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@inproceedings{chirkova2020simple,
title={A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code},
author={Nadezhda Chirkova and Sergey Troshin},
booktitle={North American Chapter of the Association for Computational Linguistics}
year={2021},
}