Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Bayesian Methods Research Group

Last update: Nov 15, 2022

Related tags

Overview

Transformers for variable misuse, function naming and code completion tasks

The official PyTorch implementation of:

Empirical Study of Transformers for Source Code [arxiv] (accepted to ESEC/FSE'21)
A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code [arxiv] (accepted to NAACL'21)

The repository also contains code for resplitting Python150k and JavaScript150k datasets (with splitting by repository, removing duplicates and the redistributable version of Py150k).

Repository structure

data_utils: scripts for downloading Python150k and JavaScript150k datasets and obtaining new train / val / test splits (with splitting by repository, removing duplicates and the redistributable version of Py150k)
vm_fn: code for Variable Misuse (VM) and Function Naming (FN) tasks (additional preprocessing, models, training etc)
cc: code for Code Completion (CC) task (additional preprocessing, models, training etc)

See README in each directory for details.

Run

The code was tested on a system with Linux 3.10.0. Experiments were run using a Tesla V100 GPU. Required libraries are listed in requirments.txt in VM_FN and CC directories. The implementation is based on PyTorch>=1.5.

Running experiments:

Download and resplit data, see data_utils for details;
Preprocess data for a task you are interested in (VM, FN or CC), see vm_fn or cc for details;
Run the experiment you are interested in, see vm_fn or cc for details.

Attribution

Parts of this code are based on the following repositories:

Citation

If you found this code useful, please cite our papers

@misc{chirkova2020empirical,
      title={Empirical Study of Transformers for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      year={2020},
      eprint={2010.07987},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@inproceedings{chirkova2020simple,
      title={A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      booktitle={North American Chapter of the Association for Computational Linguistics}
      year={2021}, 
}

Comments

Get stuck at preprocessing CC's JS data

Hi, It's a very interesting and solid study. And I'm trying to reproduce it. But when I run "bash src/scripts/preprocess_data.sh" in the "cc/main" folder, I get stuck. The process hangs here for two days:

opened by gystar 4

bug in generating data

I think line 112 (in function separate_dps) of cc/main/src/utils/utils.py should be

        aug_asts.append([ast[i : i + max_len], i + half_len])

instead of

        aug_asts.append([ast[i : i + max_len], half_len])

opened by jxzhn 4

Can current token see next token in encoder for code completion ?

Hello, I have not found there exists a mask matrix in encoder for code completion, so every node of ast will participate in self attention computation?

opened by mf1832146 3
Hi, where can i find the Supplementary materials pdf?

Thanks. Another question is the default setting of eval_part for run_all.py is 'test', so does it means that you use the test set to evaluate the checkpoint during model training? It seems a lot wired. best wishes

opened by AwdHanPeng 2
OOV handle

Greetings! Congrats for the great work! Is the anonymize operation only for user-defined variable names? Will the API be anonymized? If so, will the semantics be changed?

opened by yingweima2022 1
OOV problem

Hello, your work is very good. I'd like to ask where the code of this article "a simple approach for handling out of volatile identifiers in deep learning for source code" is. Thank you!

We also suggest adding flag --anonymize order to the generated command for the VM task, in order to use our anonymization of the out-of-vocabulary identifiers. This simple technique will increase the test quality by several percent.

but i do not find --anonymize in run_all.py

opened by yingweima2022 1

Questions about calculating MRR and predicting the code block

From my understanding, mrr function uses the variable where to filter out tokens in memory (e.g., the first half tokens for intermediate blocks as explained in #4 ) , unknown tokens, and padding tokens. Then _mrr function sorts the top-10 predicted tokens for each position and uses the indexes of correctly predicted tokens of each position to calculate MRR.

My question is are all positions in this code block predicted at the same time or they are all predicted when calculating MRR (i.e., in _step() function of LightningModel class)? Let's say this code block is [range(250, 750), 250], then the range(250, 500) is the memory, and range(500, 750) is to be predicted. So when we calculate MRR of this block, is it for all the tokens in the range(500, 750)?

How does it complete the code block? Does it predict the 501th token based on range(250, 500) and then predict 502nd token based on range(250, 501)? Could you explain more and point out the related code.

Besides, how can I use a trained model for inference and get the vocab from their idx? Here is the script I wrote. I'm not sure about it. Please let me know your idea. Thanks for your help!

...
args.load = 'path to the trained model.ckpt'

model = LightningModel(args)
model.eval()

# use eval_dataset for example
eval_dataset = model.eval_dataset
eval_dataloader = DataLoader(eval_dataset, 
    num_workers=args.num_workers, collate_fn=model.collate_fn, drop_last=False, pin_memory=True, shuffle=False)

# get the vocab of tensor idx
def idx2vocab_func(batch, idx2vocab): # assuming batch = 1
    vocab_len = len(idx2vocab)
    batch = batch.view(-1).tolist()
    input_v = list()
    for i in batch:
        if i >= vocab_len:
            input_v.append('UNK')
        else:
            input_v.append(idx2vocab[i])
    return input_v

def get_pred(batch):
    y = batch['input_seq']['values']
    y_pred_types, y_pred_values = model(batch['input_seq'], rel=batch['rel_mask'], positions=batch['positions'])
    
    ext = batch['extended'].unsqueeze(-1).repeat(1, y.size(-1))
    ext_ids = torch.arange(y.size(-1), device=ext.device).view(1, -1).repeat(*(y.size()[:-1]+(1,)))
    where = ext_ids >= ext
    where = where.view(-1)

    y_pred_values = y_pred_values.view(-1, y_pred_values.size(-1))[where]

    _, y_pred_values = torch.topk(y_pred_values, k=1, dim=-1) # choose the top1 predicted token
    return y_pred_values.view(-1)

with torch.no_grad():
    for i, sample in enumerate(eval_dataloader):
        print('---------------Input----------------')
        input_v = sample['input_seq']['values']
        print(input_v.shape)
        print('Vocab:', ' '.join(idx2vocab_func(input_v, idx2vocab_value)))

        print('---------------Target----------------')
        target_v = sample['target_seq']['values']
        print(target_v.shape)
        print('Vocab:', ' '.join(idx2vocab_func(target_v, idx2vocab_value)))

        print('---------------Predicted----------------')
        pred_v = get_pred(sample)
        print(pred_v.shape)
        print('Vocab:', ' '.join(idx2vocab_func(pred_v, idx2vocab_value)))

opened by Zero0one1 1

Instructions for reproducing the experients from the OOV anonymization paper

Greetings! Congrats for the great work! My question is when you will post the instructions for reproducing the experients from the OOV anonymization paper. I believe that will be of great help to those who follow this work. It is a very interesting work and I think that it disearves more attention. Thanks in advance!

opened by aboutzack 1

Owner

Bayesian Methods Research Group

GitHub

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

9 Jan 12, 2022

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

49 Dec 22, 2022

Implementation of the paper "Fine-Tuning Transformers: Vocabulary Transfer"

Transformer-vocabulary-transfer Implementation of the paper "Fine-Tuning Transfo

13 Nov 30, 2022

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

**Codebase and data are uploaded in progress. ** VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly ge

416 Jan 9, 2023

PromptDet: Expand Your Detector Vocabulary with Uncurated Images

PromptDet: Expand Your Detector Vocabulary with Uncurated Images Paper Website Introduction The goal of this work is to establish a scalable pipeline

103 Dec 20, 2022

RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection

RODD Official Implementation of 2022 CVPRW Paper RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection Introduction: Recent studie

17 Oct 11, 2022

Study of human inductive biases in CNNs and Transformers.

Are Convolutional Neural Networks or Transformers more like human vision? This repository contains the code and fine-tuned models of popular Convoluti

39 Dec 8, 2022

Tilted Empirical Risk Minimization (ICLR '21)

Tilted Empirical Risk Minimization This repository contains the implementation for the paper Tilted Empirical Risk Minimization ICLR 2021 Empirical ri

40 Nov 28, 2022

A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

Mixup: Beyond Empirical Risk Minimization in PyTorch This is an unofficial PyTorch implementation of mixup: Beyond Empirical Risk Minimization. The co

121 Dec 17, 2022

Pytorch implementation of the AAAI 2022 paper "Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification"

[AAAI22] Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification We point out the overlooked unbiasedness in long-tailed clas

28 Oct 18, 2022

Official Implementation for the "An Empirical Investigation of 3D Anomaly Detection and Segmentation" paper.

An Empirical Investigation of 3D Anomaly Detection and Segmentation Project | Paper Official PyTorch Implementation for the "An Empirical Investigatio

55 Dec 14, 2022

Low-code/No-code approach for deep learning inference on devices

EzEdgeAI A concept project that uses a low-code/no-code approach to implement deep learning inference on devices. It provides a componentized framewor

7 Apr 5, 2022

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study Supplementary Materials for Kentaro Matsuura, Junya Honda, Imad

4 Nov 1, 2022

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

8 Sep 14, 2022

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Spacetimeformer Multivariate Forecasting This repository contains the code for the paper, "Long-Range Transformers for Dynamic Spatiotemporal Forecast

440 Jan 2, 2023

Try out deep learning models online on Google Colab

1.5k Dec 27, 2022

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Related tags

Overview

Transformers for variable misuse, function naming and code completion tasks

Repository structure

Run

Attribution

Citation

Comments

Get stuck at preprocessing CC's JS data

bug in generating data

Can current token see next token in encoder for code completion ?

Hi, where can i find the Supplementary materials pdf?

OOV handle

OOV problem

Questions about calculating MRR and predicting the code block

Instructions for reproducing the experients from the OOV anonymization paper

Owner

Bayesian Methods Research Group

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Implementation of the paper "Fine-Tuning Transformers: Vocabulary Transfer"

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

PromptDet: Expand Your Detector Vocabulary with Uncurated Images

RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection

Study of human inductive biases in CNNs and Transformers.

Tilted Empirical Risk Minimization (ICLR '21)

A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

Pytorch implementation of the AAAI 2022 paper "Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification"

Official Implementation for the "An Empirical Investigation of 3D Anomaly Detection and Segmentation" paper.

Low-code/No-code approach for deep learning inference on devices

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers.

Securetar - A streaming wrapper around python tarfile and allow secure handling files and support encryption

Deep learning-based approach to discovering Granger causality networks in multivariate time series

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Try out deep learning models online on Google Colab