Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Overview

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:

Title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Authors: Yue Wang, Weishi Wang , Shafiq Joty, and Steven C.H. Hoi

CodeT5 demo

Updates

Oct 29, 2021

We release fine-tuned checkpoints for all the downstream tasks covered in the paper.

Oct 25, 2021

We release a CodeT5-base fine-tuned checkpoint (Salesforce/codet5-base-multi-sum) for multilingual code summarzation. Below is how to use this model:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

if __name__ == '__main__':
    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
    model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')

    text = """def svg_to_image(string, size=None):
    if isinstance(string, unicode):
        string = string.encode('utf-8')
        renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
    if not renderer.isValid():
        raise ValueError('Invalid SVG data.')
    if size is None:
        size = renderer.defaultSize()
        image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
        painter = QtGui.QPainter(image)
        renderer.render(painter)
    return image"""

    input_ids = tokenizer(text, return_tensors="pt").input_ids

    generated_ids = model.generate(input_ids, max_length=20)
    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
    # this prints: "Convert a SVG string to a QImage."

Oct 18, 2021

We add a model card for CodeT5! Please reach out if you have any questions about it.

Sep 24, 2021

CodeT5 is now in hugginface!

You can simply load the model (CodeT5-small and CodeT5-base) and do the inference:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

text = "def greet(user): print(f'hello <extra_id_0>!')"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate one code span
generated_ids = model.generate(input_ids, max_length=8)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# this prints "{user.username}"

Introduction

This repo provides the code for reproducing the experiments in CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation . CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on 8.35M functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves state-of-the-art results on 14 sub-tasks in a code intelligence benchmark - CodeXGLUE.

Paper link: https://arxiv.org/abs/2109.00859

Blog link: https://blog.einstein.ai/codet5/

The code currently includes two pre-trained checkpoints (CodeT5-small and CodeT5-base) and scripts to fine-tine them on 4 generation tasks ( code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication of our paper.

In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:

  • Text-to-code generation: generate code based on the natural language description.
  • Code autocompletion: complete the whole function of code given the target function name.
  • Code summarization: generate the summary of a function in natural language description.

Table of Contents

  1. Citation
  2. License
  3. Dependency
  4. Download
  5. Fine-tuning
  6. Get Involved

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{
    wang2021codet5,
    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
    year={2021},
}

License

The code is released under the BSD-3 License (see LICENSE.txt for details), but we also ask that users respect the following:

This software should not be used to promote or profit from:

violence, hate, and division,

environmental destruction,

abuse of human rights, or

the destruction of people's physical and mental health.

We encourage users of this software to tell us about the applications in which they are putting it to use by emailing [email protected], and to use appropriate documentation when developing high-stakes applications of this model.

Dependency

  • Pytorch 1.7.1
  • tensorboard 2.4.1
  • transformers 4.6.1
  • tree-sitter 0.2.2

Download

Instructions to download:

# pip install gsutil
cd your-cloned-codet5-path

gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" .
gsutil -m cp -r "gs://sfr-codet5-data-research/data" .
gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" .

Fine-tuning

Go to sh folder, set the WORKDIR in exp_with_args.sh to be your cloned CodeT5 repository path.

You can use run_exp.py to run a broad set of experiments by simply passing the model_tag, task, and sub_task arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base']) and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use the sub_task to specify which specific datasets to fine-tine on. Below is the full list:

--task --sub_task Description
summarize ruby/javascript/go/python/java/php code summarization task on CodeSearchNet data with six PLs
concode none text-to-code generation on Concode data
translate java-cs/cs-java code-to-code translation between Java and C#
refine small/medium code refinement on code repair data with small/medium functions
defect none code defect detection in C/C++ data
clone none code clone detection in Java data

For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run:

python run_exp.py --model_tag codet5_base --task summarize --sub_task python

Besides, you can specify:

model_dir: where to save fine-tuning checkpoints
res_dir: where to save the performance results 
summary_dir: where to save the training curves
data_num: how many data instances to use, the default -1 is for using the full data
gpu: the index of the GPU to use in the cluster

You can also revise the suggested arguments here or directly customize the exp_with_args.sh bash file. Please refer to the argument flags in configs.py for the full available options. The saved training curves in summary_dir can be visualized using tensorboard. Note that we employ one A100 GPU for all fine-tuning experiments.

How to fine-tune on your own task and dataset?

If you want to fine-tune on your dataset, you can add your own task and sub_task in configs.py (here) and add your data path and the function to read in utils.py (here and here). The read function can be implemented in _utils.py similar to this one. If your task to add is a generation task, you can simply reuse or customize the run_gen.py. For understanding tasks, please refer to run_defect.py and run_clone.py.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Comments
  • 'tuple' object has no attribute 'loss'

    'tuple' object has no attribute 'loss'

    Hi, I want to run CodeT5-base on code generation task. I run the command: python run_exp.py --model_tag codet5_base --task concode --sub_task none

    There is an error: 'tuple' object has no attribute 'loss'. CodeT5_img2

    I try to change outputs = model(input_ids=source_ids, attention_mask=source_mask, labels=target_ids, decoder_attention_mask=target_mask) to outputs, _ = model(input_ids=source_ids, attention_mask=source_mask, labels=target_ids, decoder_attention_mask=target_mask)

    There is an error: too many values to unpack (expected 2)

    What should I do?

    opened by skye95git 15
  • Is the released pre-trained model including the dual generation pre-training

    Is the released pre-trained model including the dual generation pre-training

    Dear authors,

    I noticed in the paper you mentioned that you pre-train the T5 model with identifier-aware denoising for 100 epochs and further pre-train with bimodal generation for 50 epochs. I was wondering the released model only includes the first 100 epochs or the whole 150 epochs?

    Thanks in advance for your clarification

    opened by Robin-Y-Ding 4
  • Inference for java code summarization

    Inference for java code summarization

    Is it possible to make code summarization for raw Java code?

    I can't find the example of inference for code summarization. Could you please provide an example? E.g., I expect the following code:

    from transformers import RobertaTokenizer,  WHICH_MODELTO_USE
    
    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
    model = WHICH_MODELTO_USE.from_pretrained('Salesforce/codet5-base')
    
    java_code = 'int i = 0; ++i;  int b = runSomeFunction(i); extract(b);'
    code_summarization = model.predict(java_code)
    print(code_summarization)
    

    The expected result is the following: 'Extracts and returns max value'

    Is it possible to make such the prediction? The problem is I can't understand how you are translating from code to the vector which will be used to predict the summarization without pretraining procedures.

    Could you please provide an example?

    opened by lyriccoder 4
  • How to get embedding for javascript and python code snippet?

    How to get embedding for javascript and python code snippet?

    I have a couple of questions:

    a) How can I use CodeT5 to extract embedding for JavaScript and Python code? b) Can I feed incomplete code JavaScript and Python snippet to extract embedding? Or the code snippet needs to be complete? c) Have anyone used CodeT5 to perform code to code search?

    opened by smith-co 3
  • Fined-tuned checkpoints -> Code clone detection

    Fined-tuned checkpoints -> Code clone detection

    Hi,

    I am hoping to reproduce the results on code clone detection task. This might seem like a silly question but the fined-tuned checkpoints released doesn't include the RobertaClassificationHead parameters, right? I am able to load only the T5ForConditionalGeneration model using the provided checkpoints for the task.

    So, how do I go about loading the entire CloneModel?

    opened by Chinmayrane16 3
  • Can the CodeT5 model do code autocompletion without fine-tuning?

    Can the CodeT5 model do code autocompletion without fine-tuning?

    The readme mentioned that this is used for Code autocompletion in VSCode, I wonder how to use CodeT5 without fine-tuning as a language model to complete code given some previous context in code?

    opened by frankxu2004 3
  • Regarding Code Generation task.

    Regarding Code Generation task.

    Can I use it for Code generation? For example if I give a query, "Add two numbers", and it should generate the code for that. And if Yes, can you please suggest how can I prepare the dataset for this task or can I use the dataset which you mentioned.

    Thank you

    opened by BakingBrains 3
  • Pretrained model for prediction

    Pretrained model for prediction

    Can you kindly elaborate on how we can use the fine-tuned checkpoints for the prediction of new data in concode task? Say this is my prediction data: {"code": "public integer sum(Integer arg0,Integer arg1) {return result;}", "nl": "Add two integers. concode_field_sep int sum concode_field_sep int result"} If I understand correctly then concode is supposed to complete these functions. However, I am not sure how to generate prediction on this sample data. I tried replacing the test file containing original test data with this sample test data and then ran this command python run_exp.py --model_tag codet5_small --task concode --sub_task none This command starts with training, then evaluating and finally testing. However, I am interested in only prediction. Isn't there any way to directly generate predictions from fine-tuned model on concode ? Kindly let me know if I am doing something wrong.

    opened by surtantheta 3
  • About  AI coding assistant demo

    About AI coding assistant demo

    Hi, the newly added AI coding assistant demo is cool! I have a few questions about it:

    1. Did you make the codeT5 model into VS Code plugin? How did you do that?

    2. When demonstrating code generation, editing the comment generates the corresponding code snippet. Aren't the inputs to the code generation model natural Language Description and class Environment?

    捕获

    The format of the data in the Concode dataset is

    {
        "code": "int function ( double [ ] arg0 , double [ ] arg1 ) { int loc0 = arg0 . length - arg1 . length ; outer : for ( int loc1 = 0 ; loc1 <= loc0 ; loc1 ++ ) { for ( int loc2 = 0 ; loc2 < arg1 . length ; loc2 ++ ) { if ( ne ( arg0 [ loc1 + loc2 ] , arg1 [ loc2 ] ) ) { continue outer ; } } return ( loc1 ) ; } return ( - 1 ) ; }",
        "nl": "searches for the first subsequence of a that matches sub elementwise . elements of sub are considered to match elements of a if they pass the #eq test . concode_field_sep double max_ratio concode_elem_sep double min_ratio concode_elem_sep boolean off concode_field_sep boolean isElemMatch concode_elem_sep int compare concode_elem_sep boolean isSubset concode_elem_sep boolean ne concode_elem_sep boolean lt concode_elem_sep boolean gte concode_elem_sep void set_rel_diff concode_elem_sep boolean eq concode_elem_sep boolean lte concode_elem_sep boolean gt"
    }
    

    Is it possible to generate accurate code snippet by typing only comments without class Environment? Doesn't the loss of context information affect the quality of the generated code?

    opened by skye95git 3
  • Compilable code

    Compilable code

    I've finetuned CodeT5 large on a small python dataset(~1700) data points. I see that the results are more or less correct but the code is not always compilable(due to inconsistent spacing and new line characters). Any idea on fixing this? And how CodeBLEU work if the code generated by the model isn't compilable? The model might generate non compilable code during initial phases of the training right?

    opened by Debdeep1998 2
  • Minimal snips to run CodeT5 on each task

    Minimal snips to run CodeT5 on each task

    Hello, I have spent quite some time trying to run the fine-tuned model you kindly provided, on the clone detection task, with no luck. Could you provide minimal python scripts to load and run the model on each of the downstream tasks, as was done for ntp and summarization on 🤗?

    Thanks a lot!

    opened by rocco-fortuna 2
  • Questions about data preprocessing

    Questions about data preprocessing

    Hi.

    Me and my two colleagues are interested in replicating the results of CodeT5-base on code generation task with our own dataset. However we're having a few hiccups on preprocessing data, and we hope you don't mind a few questions.

    Mainly, we're wondering how you dealt with the Google BigQuery data alongside CodeSearchNet. From our knowledge, CodeSearchNet data is consisted of codes that are nicely isolated blocks of function codes, while the BigQuery data and other extra data of C/C# from open-source Github repositories is, as far as we guess, isn't presented in such a convenient matter.

    Our own data is also in a similar position where none of the codes are either isolated to blocks of function codes, but rather mostly a complete file of itself. We were wondering if you did any preprocessing of your own so that the extra data of C/C# would match that of CodeSearchNet, or if you just used it raw.

    And if you did use it raw, has it affected the performance compared to when the model was trained only with CodeSearchNet data? Thank you in advance.

    P.S. My colleagues are also wondering how you dealt with the whitespace, arguing that the paper wasn't so clear with that. One argues that you discarded whitespace all-together, while the other argues that you only removed duplicates of whitespce into one instance. ex) A. '\s\s\s' --> '' B. '\s\s\s' --> '\s'

    opened by qkim2525 0
Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

Yiming Cui 1.2k Dec 30, 2022
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Benjamin Heinzerling 1.1k Jan 3, 2023
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

null 117 Jan 7, 2023
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Vincent Hellendoorn 947 Dec 28, 2022
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Tobias Lee 89 Jan 3, 2023
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

null 22 Dec 14, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

THUNLP 2.3k Jan 8, 2023
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

null 2 Oct 17, 2021
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 3, 2023
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Kakao Brain 797 Dec 26, 2022
Laboratory for Social Machines 84 Dec 20, 2022
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
A fast and lightweight python-based CTC beam search decoder for speech recognition.

pyctcdecode A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support

Kensho 315 Dec 21, 2022