Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Salesforce

Last update: Jan 8, 2023

Related tags

Overview

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:

Title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Authors: Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi

Updates

Sep 24, 2021

CodeT5 is now in hugginface!

You can simply load the model (CodeT5-small and CodeT5-base) and do the inference:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

text = "def greet(user): print(f'hello <extra_id_0>!')"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate one code span
generated_ids = model.generate(input_ids, max_length=8)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# this prints "{user.username}"

Introduction

This repo provides the code for reproducing the experiments in CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on 8.35M functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves state-of-the-art results on 14 sub-tasks in a code intelligence benchmark - CodeXGLUE.

Paper link: https://arxiv.org/abs/2109.00859

Blog link: https://blog.einstein.ai/codet5/

The code currently includes two pre-trained checkpoints (CodeT5-small and CodeT5-base) and scripts to fine-tine them on 4 generation tasks (code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and clone detection) in CodeXGLUE.

In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:

Text-to-code generation: generate code based on the natural language description.
Code autocompletion: complete the whole function of code given the target function name.
Code summarization: generate the summary of a function in natural language description.

Citation
License
Dependency
Download
Fine-tuning
Get Involved

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{
    wang2021codet5,
    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
    year={2021},
}

License

The code is released under the BSD-3 License (see LICENSE.txt for details), but we also ask that users respect the following:

This software should not be used to promote or profit from:

violence, hate, and division,

environmental destruction,

abuse of human rights, or

the destruction of people's physical and mental health.

We encourage users of this software to tell us about the applications in which they are putting it to use by emailing [email protected], and to use appropriate documentation when developing high-stakes applications of this model.

Dependency

Pytorch 1.7.1
tensorboard 2.4.1
transformers 4.6.1
tree-sitter 0.2.2

Download

Pre-trained checkpoints & Fine-tuning data
Fine-tuned checkpoints (TBA)
Extra C/C# pre-training data (TBA)

Instructions to download:

pip install gsutil

gsutil -m cp -r "gs://sfr-codet5-data-research/data/" .

mkdir pretrained_models; cd pretrained_models
gsutil -m cp -r \
  "gs://sfr-codet5-data-research/pretrained_models/codet5_small" \
  "gs://sfr-codet5-data-research/pretrained_models/codet5_base" \
  .

The repository structure will look like the following after the download:

├── CODE_OF_CONDUCT.md
├── README.md
├── SECURITY.md
├── codet5.gif
├── configs.py
├── models.py
├── run_clone.py
├── run_gen.py
├── utils.py
├── _utils.py
├── LICENSE.txt
├── data
│   ├── clone
│   ├── concode
│   ├── defect
│   ├── refine
│   │   ├── medium
│   │   └── small
│   ├── summarize
│   │   ├── go
│   │   ├── java
│   │   ├── javascript
│   │   ├── php
│   │   ├── python
│   │   └── ruby
│   └── translate
├── evaluator
│   ├── bleu.py
│   ├── smooth_bleu.py
│   └── CodeBLEU
├── pretrained_models
│   ├── codet5_base
│   └── codet5_small
├── sh
│   ├── exp_with_args.sh
│   ├── run_exp.py
│   ├── results
│   ├── saved_models
│   └── tensorboard
└── tokenizer
    └── salesforce
        ├── codet5-merges.txt
        └── codet5-vocab.json

Fine-tuning

Go to sh folder, set the WORKDIR in exp_with_args.sh to be your downloaded CodeT5 repository path.

You can use run_exp.py to run a broad set of experiments by simply passing the model_tag, task, and sub_task arguments. In total, we support four models (i.e., ['roberta', 'codebert', 'codet5_small', 'codet5_base']) and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use the sub_task to specify which specific datasets to fine-tine on.

For example, if you want to run CodeT5-base model on the code summarization task for Ruby, you can simply run:

python run_exp.py --model_tag codet5_base --task summarize --sub_task ruby

Besides, you can specify:

model_dir: where to save fine-tuning checkpoints
res_dir: where to save the performance results 
summary_dir: where to save the training curves
data_num: how many data instances to use, the default -1 is for using the full data
gpu: the index of the GPU to use in the cluster

You can also revise the suggested arguments here and refer to the argument flags in configs.py for the full available options. The saved training curves in summary_dir can be visualized using tensorboard.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Comments

'tuple' object has no attribute 'loss'

Hi, I want to run CodeT5-base on code generation task. I run the command: python run_exp.py --model_tag codet5_base --task concode --sub_task none

There is an error: 'tuple' object has no attribute 'loss'.

I try to change outputs = model(input_ids=source_ids, attention_mask=source_mask, labels=target_ids, decoder_attention_mask=target_mask) to outputs, _ = model(input_ids=source_ids, attention_mask=source_mask, labels=target_ids, decoder_attention_mask=target_mask)

There is an error: too many values to unpack (expected 2)

What should I do?

opened by skye95git 15
Is the released pre-trained model including the dual generation pre-training

Dear authors,

I noticed in the paper you mentioned that you pre-train the T5 model with identifier-aware denoising for 100 epochs and further pre-train with bimodal generation for 50 epochs. I was wondering the released model only includes the first 100 epochs or the whole 150 epochs?

Thanks in advance for your clarification

opened by Robin-Y-Ding 4
Inference for java code summarization
Is it possible to make code summarization for raw Java code?

I can't find the example of inference for code summarization. Could you please provide an example? E.g., I expect the following code:

from transformers import RobertaTokenizer, WHICH_MODELTO_USE tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base') model = WHICH_MODELTO_USE.from_pretrained('Salesforce/codet5-base') java_code = 'int i = 0; ++i; int b = runSomeFunction(i); extract(b);' code_summarization = model.predict(java_code) print(code_summarization)

The expected result is the following: 'Extracts and returns max value'

Is it possible to make such the prediction? The problem is I can't understand how you are translating from code to the vector which will be used to predict the summarization without pretraining procedures.

Could you please provide an example?
opened by lyriccoder 4
How to get embedding for javascript and python code snippet?

I have a couple of questions:

a) How can I use CodeT5 to extract embedding for JavaScript and Python code? b) Can I feed incomplete code JavaScript and Python snippet to extract embedding? Or the code snippet needs to be complete? c) Have anyone used CodeT5 to perform code to code search?

opened by smith-co 3
Fined-tuned checkpoints -> Code clone detection

Hi,

I am hoping to reproduce the results on code clone detection task. This might seem like a silly question but the fined-tuned checkpoints released doesn't include the RobertaClassificationHead parameters, right? I am able to load only the T5ForConditionalGeneration model using the provided checkpoints for the task.

So, how do I go about loading the entire CloneModel?

opened by Chinmayrane16 3
Can the CodeT5 model do code autocompletion without fine-tuning?

The readme mentioned that this is used for Code autocompletion in VSCode, I wonder how to use CodeT5 without fine-tuning as a language model to complete code given some previous context in code?

opened by frankxu2004 3
Regarding Code Generation task.

Can I use it for Code generation? For example if I give a query, "Add two numbers", and it should generate the code for that. And if Yes, can you please suggest how can I prepare the dataset for this task or can I use the dataset which you mentioned.

Thank you

opened by BakingBrains 3
Pretrained model for prediction

Can you kindly elaborate on how we can use the fine-tuned checkpoints for the prediction of new data in concode task? Say this is my prediction data: {"code": "public integer sum(Integer arg0,Integer arg1) {return result;}", "nl": "Add two integers. concode_field_sep int sum concode_field_sep int result"} If I understand correctly then concode is supposed to complete these functions. However, I am not sure how to generate prediction on this sample data. I tried replacing the test file containing original test data with this sample test data and then ran this command python run_exp.py --model_tag codet5_small --task concode --sub_task none This command starts with training, then evaluating and finally testing. However, I am interested in only prediction. Isn't there any way to directly generate predictions from fine-tuned model on concode ? Kindly let me know if I am doing something wrong.

opened by surtantheta 3

About AI coding assistant demo

Hi, the newly added AI coding assistant demo is cool! I have a few questions about it:

Did you make the codeT5 model into VS Code plugin? How did you do that?
When demonstrating code generation, editing the comment generates the corresponding code snippet. Aren't the inputs to the code generation model natural Language Description and class Environment?

The format of the data in the Concode dataset is

{
    "code": "int function ( double [ ] arg0 , double [ ] arg1 ) { int loc0 = arg0 . length - arg1 . length ; outer : for ( int loc1 = 0 ; loc1 <= loc0 ; loc1 ++ ) { for ( int loc2 = 0 ; loc2 < arg1 . length ; loc2 ++ ) { if ( ne ( arg0 [ loc1 + loc2 ] , arg1 [ loc2 ] ) ) { continue outer ; } } return ( loc1 ) ; } return ( - 1 ) ; }",
    "nl": "searches for the first subsequence of a that matches sub elementwise . elements of sub are considered to match elements of a if they pass the #eq test . concode_field_sep double max_ratio concode_elem_sep double min_ratio concode_elem_sep boolean off concode_field_sep boolean isElemMatch concode_elem_sep int compare concode_elem_sep boolean isSubset concode_elem_sep boolean ne concode_elem_sep boolean lt concode_elem_sep boolean gte concode_elem_sep void set_rel_diff concode_elem_sep boolean eq concode_elem_sep boolean lte concode_elem_sep boolean gt"
}

Is it possible to generate accurate code snippet by typing only comments without class Environment? Doesn't the loss of context information affect the quality of the generated code?

opened by skye95git 3

Compilable code

I've finetuned CodeT5 large on a small python dataset(~1700) data points. I see that the results are more or less correct but the code is not always compilable(due to inconsistent spacing and new line characters). Any idea on fixing this? And how CodeBLEU work if the code generated by the model isn't compilable? The model might generate non compilable code during initial phases of the training right?

opened by Debdeep1998 2
Minimal snips to run CodeT5 on each task

Hello, I have spent quite some time trying to run the fine-tuned model you kindly provided, on the clone detection task, with no luck. Could you provide minimal python scripts to load and run the model on each of the downstream tasks, as was done for ntp and summarization on 🤗?

Thanks a lot!

opened by rocco-fortuna 2
Questions about data preprocessing

Hi.

Me and my two colleagues are interested in replicating the results of CodeT5-base on code generation task with our own dataset. However we're having a few hiccups on preprocessing data, and we hope you don't mind a few questions.

Mainly, we're wondering how you dealt with the Google BigQuery data alongside CodeSearchNet. From our knowledge, CodeSearchNet data is consisted of codes that are nicely isolated blocks of function codes, while the BigQuery data and other extra data of C/C# from open-source Github repositories is, as far as we guess, isn't presented in such a convenient matter.

Our own data is also in a similar position where none of the codes are either isolated to blocks of function codes, but rather mostly a complete file of itself. We were wondering if you did any preprocessing of your own so that the extra data of C/C# would match that of CodeSearchNet, or if you just used it raw.

And if you did use it raw, has it affected the performance compared to when the model was trained only with CodeSearchNet data? Thank you in advance.

P.S. My colleagues are also wondering how you dealt with the whitespace, arguing that the paper wasn't so clear with that. One argues that you discarded whitespace all-together, while the other argues that you only removed duplicates of whitespce into one instance. ex) A. '\s\s\s' --> '' B. '\s\s\s' --> '\s'

opened by qkim2525 0

Owner

Salesforce

A variety of vendor agnostic projects which power Salesforce

GitHub https://arxiv.org/abs/2109.00859

Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

1.2k Dec 30, 2022

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

1.1k Jan 3, 2023

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

4 Jul 1, 2022

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

117 Jan 7, 2023

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

48 Dec 14, 2022

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

29 Nov 30, 2022

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022

Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

947 Dec 28, 2022

Must-read papers on improving efficiency for pre-trained language models.

89 Jan 3, 2023

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

2.3k Jan 8, 2023

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021

Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展，本项目提供了 CPM-LM (2.6B) 模型的文本生成代码，可用于文本生成的本地测试，并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理，我们建议使用高效推理工具BMI

1.4k Jan 3, 2023

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

797 Dec 26, 2022

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and Twitter-Stanza p

84 Dec 20, 2022

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

6 Apr 29, 2022

A fast and lightweight python-based CTC beam search decoder for speech recognition.

pyctcdecode A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support

315 Dec 21, 2022

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Related tags

Overview

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Updates

Introduction

Table of Contents

Citation

License

Dependency

Download

Fine-tuning

Get Involved

Comments

Owner

Salesforce

Google and Stanford University released a new pre-trained model called ELECTRA

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Guide to using pre-trained large language models of source code

Must-read papers on improving efficiency for pre-trained language models.

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Chinese Pre-Trained Language Models (CPM-LM) Version-I

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

A fast and lightweight python-based CTC beam search decoder for speech recognition.