Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Tanuj Sur

Last update: Jul 1, 2022

Related tags

Text Data & NLP CodeBERT-Implementation

Overview

CodeBERT-Implementation

In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages.
We are interested in evaluating CodeBERT specifically in Natural language code search. Given a natural language as the input, the objective of code search is to find the most semantically related code from a collection of codes.

This code was implemented on a 64-bit Windows system with 8 GB ram and GeForce GTX 1650 4GB graphics card.

Due to limited compuational power, we have trained and evaluated the model on a smaller data compared to the original data.

Language	Training data size		Validation data size		Test data size for batch_0
Language	Original	Our	Original	Our	Original	Our
Ruby	97580	500	4417	100	1000000	20000
Go	635653	500	28483	100	1000000	20000
PHP	1047404	500	52029	100	1000000	20000
Python	824342	500	46213	100	1000000	20000
Java	908886	500	30655	100	1000000	20000
Javascript	247773	500	16505	100	1000000	20000

Compared to the code in original repo, code in this repo can be implemented directly in Windows system without any hindrance. We have already provided a subset of pre-processed data for batch_0 (shown in table under Testing data size) in ./data/codesearch/test/

Fine tuning pretrained model CodeBERT on individual languages

lang = go
cd CodeBERT-Implementation
! python run_classifier.py --model_type roberta --task_name codesearch --do_train --do_eval --eval_all_checkpoints --train_file train_short.txt --dev_file valid_short.txt --max_seq_length 50 --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 8 --learning_rate 1e-5 --num_train_epochs 1 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir CodeBERT-Implementation/data/codesearch/train_valid/$lang/ --output_dir ./models/$lang/ --model_name_or_path microsoft/codebert-base

Inference and Evaluation

lang = go
idx = 0
! python run_classifier.py --model_type roberta --model_name_or_path microsoft/codebert-base --task_name codesearch --do_predict --output_dir CodeBERT-Implementation/data/models/$lang --data_dir CodeBERT-Implementation/data/codesearch/test/$lang/ --max_seq_length 50 --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 8 --learning_rate 1e-5 --num_train_epochs 1 --test_file batch_short_${idx}.txt --pred_model_dir ./models/ruby/checkpoint-best/ --test_result_dir ./results/$lang/${idx}_batch_result.txt

! python mrr.py

The Mean Evaluation Rank (MER), the evaluation mteric, for the subset of data is given as follows:

Language	MER
Ruby	0.0037
Go	0.0034
PHP	0.0044
Python	0.0052
Java	0.0033
Java script	0.0054

The accuracy is way less than what is reported in the paper. However, the purpose of this repo is to provide the user, ready to implement data of CodeBERT without any heavy downloads. We have also included the prediction results in this repo corresponding to the test data.

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

2.3k Jan 8, 2023

Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展，本项目提供了 CPM-LM (2.6B) 模型的文本生成代码，可用于文本生成的本地测试，并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理，我们建议使用高效推理工具BMI

1.4k Jan 3, 2023

Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

1.2k Dec 30, 2022

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

65 Sep 21, 2022

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

48 Dec 14, 2022

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

736 Jan 3, 2023

Comments

XYZ

Input: {"code": "return", "nl": "Add 2 integers."} Output: int [ ] function ( int [ ] arg0 , int [ ] arg1 ) { int [ ] loc0 = new int [ arg0 . length + arg1 . length ] ; for ( int loc1 = 0 ; loc1 < arg0 . length ; loc1 ++ ) loc0 [ loc1 ] = arg0 [ loc1 ] ; for ( int loc1 = 0 ; loc1 < arg0 . length ; loc1 ++ ) loc0 [ loc1 ] += arg1 [ loc1 ] ; return loc0 ; }

Input: {"code": "return", "nl": "Subtract 2 integers."} Output: int function ( int arg0 , int arg1 ) { return arg0 - arg1 ; }

Input: {"code": "return", "nl": "Add 2 integers and store it in a variable sum. Then subtract 5 from the variable sum."} Output: void function ( int [ ] arg0 , int [ ] arg1 ) { arg0 [ 0 ] += arg1 [ 0 ] ; arg0 [ 1 ] += arg1 [ 1 ] ; arg0 [ 2 ] += arg1 [ 2 ] ; }

Input: {"code": "public integer IntegerSum(Integer arg0,Integer arg1,Integer arg2){return Temp;}", "nl": "Addition of 3 integers and store the sum in a variable Temp. concode_field_sep int Temp concode_field_sep IntegerSum"} Output: void function ( int arg0 , int arg1 , int arg2 ) { int loc0 = 0 ; int loc1 = 0 ; for ( int loc2 = 0 ; loc2 < 3 ; loc2 ++ ) loc0 += arg0 * arg1 ; int loc3 = arg0 * arg2 ; while ( loc0 < loc1 ) loc0 = loc0 + loc2 ; while ( loc0 < loc1 ) loc0 = loc0 + loc2 ; arg0 = arg0 * arg1 ; }

Input: {"code": "public integer sum(Integer arg0, Integer arg1, Integer arg2)", "nl": "Multiply two integers and add 100 their product."} Output: int function ( int arg0 , int arg1 ) { return arg0 * arg1 / 100 ; }

Input: {"code": "return", "nl": "Sort an array of intergers in descending order and return the sorted array."} Output: List < T > function ( List < T > arg0 ) { List < T > loc0 = new ArrayList < T > ( arg0 ) ; Collections . sort ( loc0 , new Comparator < T > ( ) { @ Override public int compare ( T arg1 , T arg2 ) { return arg1 . compareTo ( arg2 ) ; } } ) ; return loc0 ; }

opened by surtantheta 0

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Related tags

Overview

CodeBERT-Implementation

Fine tuning pretrained model CodeBERT on individual languages

Inference and Evaluation

You might also like...

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Chinese Pre-Trained Language Models (CPM-LM) Version-I

Google and Stanford University released a new pre-trained model called ELECTRA

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

Comments

XYZ

Owner

Tanuj Sur

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

A programming language with logic of Python, and syntax of all languages.

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

NL. The natural language programming language.

Guide to using pre-trained large language models of source code

Contains descriptions and code of the mini-projects developed in various programming languages

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Must-read papers on improving efficiency for pre-trained language models.