Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Overview

Transformers for variable misuse, function naming and code completion tasks

The official PyTorch implementation of:

  • Empirical Study of Transformers for Source Code [arxiv] (accepted to ESEC/FSE'21)
  • A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code [arxiv] (accepted to NAACL'21)

The repository also contains code for resplitting Python150k and JavaScript150k datasets (with splitting by repository, removing duplicates and the redistributable version of Py150k).

Repository structure

  • data_utils: scripts for downloading Python150k and JavaScript150k datasets and obtaining new train / val / test splits (with splitting by repository, removing duplicates and the redistributable version of Py150k)
  • vm_fn: code for Variable Misuse (VM) and Function Naming (FN) tasks (additional preprocessing, models, training etc)
  • cc: code for Code Completion (CC) task (additional preprocessing, models, training etc)

See README in each directory for details.

Run

The code was tested on a system with Linux 3.10.0. Experiments were run using a Tesla V100 GPU. Required libraries are listed in requirments.txt in VM_FN and CC directories. The implementation is based on PyTorch>=1.5.

Running experiments:

  1. Download and resplit data, see data_utils for details;
  2. Preprocess data for a task you are interested in (VM, FN or CC), see vm_fn or cc for details;
  3. Run the experiment you are interested in, see vm_fn or cc for details.

Attribution

Parts of this code are based on the following repositories:

Citation

If you found this code useful, please cite our papers

@misc{chirkova2020empirical,
      title={Empirical Study of Transformers for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      year={2020},
      eprint={2010.07987},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
@inproceedings{chirkova2020simple,
      title={A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      booktitle={North American Chapter of the Association for Computational Linguistics}
      year={2021}, 
}
Comments
  • Get stuck at preprocessing CC's JS data

    Get stuck at preprocessing CC's JS data

    Hi, It's a very interesting and solid study. And I'm trying to reproduce it. But when I run "bash src/scripts/preprocess_data.sh" in the "cc/main" folder, I get stuck. The process hangs here for two days: image

    opened by gystar 4
  • bug in generating data

    bug in generating data

    I think line 112 (in function separate_dps) of cc/main/src/utils/utils.py should be

            aug_asts.append([ast[i : i + max_len], i + half_len])
    

    instead of

            aug_asts.append([ast[i : i + max_len], half_len])
    
    opened by jxzhn 4
  • Can current token see next token in encoder for code completion ?

    Can current token see next token in encoder for code completion ?

    Hello, I have not found there exists a mask matrix in encoder for code completion, so every node of ast will participate in self attention computation?

    opened by mf1832146 3
  • Hi, where can i find the Supplementary materials pdf?

    Hi, where can i find the Supplementary materials pdf?

    Thanks. Another question is the default setting of eval_part for run_all.py is 'test', so does it means that you use the test set to evaluate the checkpoint during model training? It seems a lot wired. best wishes

    opened by AwdHanPeng 2
  • OOV handle

    OOV handle

    Greetings! Congrats for the great work! Is the anonymize operation only for user-defined variable names? Will the API be anonymized? If so, will the semantics be changed?

    opened by yingweima2022 1
  • OOV problem

    OOV problem

    Hello, your work is very good. I'd like to ask where the code of this article "a simple approach for handling out of volatile identifiers in deep learning for source code" is. Thank you!

    We also suggest adding flag --anonymize order to the generated command for the VM task, in order to use our anonymization of the out-of-vocabulary identifiers. This simple technique will increase the test quality by several percent.

    but i do not find --anonymize in run_all.py

    opened by yingweima2022 1
  • Questions about calculating MRR and predicting the code block

    Questions about calculating MRR and predicting the code block

    From my understanding, mrr function uses the variable where to filter out tokens in memory (e.g., the first half tokens for intermediate blocks as explained in #4 ) , unknown tokens, and padding tokens. Then _mrr function sorts the top-10 predicted tokens for each position and uses the indexes of correctly predicted tokens of each position to calculate MRR.

    My question is are all positions in this code block predicted at the same time or they are all predicted when calculating MRR (i.e., in _step() function of LightningModel class)? Let's say this code block is [range(250, 750), 250], then the range(250, 500) is the memory, and range(500, 750) is to be predicted. So when we calculate MRR of this block, is it for all the tokens in the range(500, 750)?

    How does it complete the code block? Does it predict the 501th token based on range(250, 500) and then predict 502nd token based on range(250, 501)? Could you explain more and point out the related code.

    Besides, how can I use a trained model for inference and get the vocab from their idx? Here is the script I wrote. I'm not sure about it. Please let me know your idea. Thanks for your help!

    ...
    args.load = 'path to the trained model.ckpt'
    
    model = LightningModel(args)
    model.eval()
    
    # use eval_dataset for example
    eval_dataset = model.eval_dataset
    eval_dataloader = DataLoader(eval_dataset, 
        num_workers=args.num_workers, collate_fn=model.collate_fn, drop_last=False, pin_memory=True, shuffle=False)
    
    # get the vocab of tensor idx
    def idx2vocab_func(batch, idx2vocab): # assuming batch = 1
        vocab_len = len(idx2vocab)
        batch = batch.view(-1).tolist()
        input_v = list()
        for i in batch:
            if i >= vocab_len:
                input_v.append('UNK')
            else:
                input_v.append(idx2vocab[i])
        return input_v
    
    def get_pred(batch):
        y = batch['input_seq']['values']
        y_pred_types, y_pred_values = model(batch['input_seq'], rel=batch['rel_mask'], positions=batch['positions'])
        
        ext = batch['extended'].unsqueeze(-1).repeat(1, y.size(-1))
        ext_ids = torch.arange(y.size(-1), device=ext.device).view(1, -1).repeat(*(y.size()[:-1]+(1,)))
        where = ext_ids >= ext
        where = where.view(-1)
    
        y_pred_values = y_pred_values.view(-1, y_pred_values.size(-1))[where]
    
        _, y_pred_values = torch.topk(y_pred_values, k=1, dim=-1) # choose the top1 predicted token
        return y_pred_values.view(-1)
    
    with torch.no_grad():
        for i, sample in enumerate(eval_dataloader):
            print('---------------Input----------------')
            input_v = sample['input_seq']['values']
            print(input_v.shape)
            print('Vocab:', ' '.join(idx2vocab_func(input_v, idx2vocab_value)))
    
            print('---------------Target----------------')
            target_v = sample['target_seq']['values']
            print(target_v.shape)
            print('Vocab:', ' '.join(idx2vocab_func(target_v, idx2vocab_value)))
    
            print('---------------Predicted----------------')
            pred_v = get_pred(sample)
            print(pred_v.shape)
            print('Vocab:', ' '.join(idx2vocab_func(pred_v, idx2vocab_value)))
    
    opened by Zero0one1 1
  • Instructions for reproducing the experients from the OOV anonymization paper

    Instructions for reproducing the experients from the OOV anonymization paper

    Greetings! Congrats for the great work! My question is when you will post the instructions for reproducing the experients from the OOV anonymization paper. I believe that will be of great help to those who follow this work. It is a very interesting work and I think that it disearves more attention. Thanks in advance!

    opened by aboutzack 1
Owner
Bayesian Methods Research Group
Bayesian Methods Research Group
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

null 9 Jan 12, 2022
The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

Dinghan Shen 49 Dec 22, 2022
Implementation of the paper "Fine-Tuning Transformers: Vocabulary Transfer"

Transformer-vocabulary-transfer Implementation of the paper "Fine-Tuning Transfo

LEYA 13 Nov 30, 2022
Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

**Codebase and data are uploaded in progress. ** VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly ge

null 416 Jan 9, 2023
PromptDet: Expand Your Detector Vocabulary with Uncurated Images

PromptDet: Expand Your Detector Vocabulary with Uncurated Images Paper Website Introduction The goal of this work is to establish a scalable pipeline

null 103 Dec 20, 2022
RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection

RODD Official Implementation of 2022 CVPRW Paper RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection Introduction: Recent studie

Umar Khalid 17 Oct 11, 2022
Study of human inductive biases in CNNs and Transformers.

Are Convolutional Neural Networks or Transformers more like human vision? This repository contains the code and fine-tuned models of popular Convoluti

Shikhar Tuli 39 Dec 8, 2022
Tilted Empirical Risk Minimization (ICLR '21)

Tilted Empirical Risk Minimization This repository contains the implementation for the paper Tilted Empirical Risk Minimization ICLR 2021 Empirical ri

Tian Li 40 Nov 28, 2022
A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

Mixup: Beyond Empirical Risk Minimization in PyTorch This is an unofficial PyTorch implementation of mixup: Beyond Empirical Risk Minimization. The co

Harry Yang 121 Dec 17, 2022
Pytorch implementation of the AAAI 2022 paper "Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification"

[AAAI22] Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification We point out the overlooked unbiasedness in long-tailed clas

PatatiPatata 28 Oct 18, 2022
Official Implementation for the "An Empirical Investigation of 3D Anomaly Detection and Segmentation" paper.

An Empirical Investigation of 3D Anomaly Detection and Segmentation Project | Paper Official PyTorch Implementation for the "An Empirical Investigatio

Eliahu Horwitz 55 Dec 14, 2022
Low-code/No-code approach for deep learning inference on devices

EzEdgeAI A concept project that uses a low-code/no-code approach to implement deep learning inference on devices. It provides a componentized framewor

On-Device AI Co., Ltd. 7 Apr 5, 2022
Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study Supplementary Materials for Kentaro Matsuura, Junya Honda, Imad

Kentaro Matsuura 4 Nov 1, 2022
Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

STARS Laboratory 8 Sep 14, 2022
Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers.

Contra-OOD Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers. Requirements PyTorch Transformers datasets

Wenxuan Zhou 27 Oct 28, 2022
Securetar - A streaming wrapper around python tarfile and allow secure handling files and support encryption

Secure Tar Secure Tarfile library It's a streaming wrapper around python tarfile

Pascal Vizeli 2 Dec 9, 2022
Ian Covert 130 Jan 1, 2023
Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Spacetimeformer Multivariate Forecasting This repository contains the code for the paper, "Long-Range Transformers for Dynamic Spatiotemporal Forecast

QData 440 Jan 2, 2023
Try out deep learning models online on Google Colab

Try out deep learning models online on Google Colab

Erdene-Ochir Tuguldur 1.5k Dec 27, 2022