CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Microsoft

Last update: Jan 3, 2023

Related tags

Text Data & NLP CodeBERT

Overview

CodeBERT

This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

Dependency

pip install torch
pip install transformers

Quick Tour

We use huggingface/transformers framework to train the model. You can use our model like the pre-trained Roberta base. Now, We give an example on how to load the model.

import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)

NL-PL Embeddings

Here, we give an example to obtain embedding from CodeBERT.

 >> model = AutoModel.from_pretrained("microsoft/codebert-base") >>> nl_tokens=tokenizer.tokenize("return maximum value") ['return', 'Ġmaximum', 'Ġvalue'] >>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b") ['def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb'] >>> tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.sep_token] ['', 'return', 'Ġmaximum', 'Ġvalue', '', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb', ''] >>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens) [0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2] >>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0] torch.Size([1, 23, 768]) tensor([[-0.1423, 0.3766, 0.0443, ..., -0.2513, -0.3099, 0.3183], [-0.5739, 0.1333, 0.2314, ..., -0.1240, -0.1219, 0.2033], [-0.1579, 0.1335, 0.0291, ..., 0.2340, -0.8801, 0.6216], ..., [-0.4042, 0.2284, 0.5241, ..., -0.2046, -0.2419, 0.7031], [-0.3894, 0.4603, 0.4797, ..., -0.3335, -0.6049, 0.4730], [-0.1433, 0.3785, 0.0450, ..., -0.2527, -0.3121, 0.3207]], grad_fn=
  ) "> 
  >>> from transformers import AutoTokenizer, AutoModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
>>> model = AutoModel.from_pretrained("microsoft/codebert-base")
>>> nl_tokens=tokenizer.tokenize("return maximum value")
['return', 'Ġmaximum', 'Ġvalue']
>>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
['def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb']
>>> tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.sep_token]
['', 'return', 'Ġmaximum', 'Ġvalue', '', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb', '']
>>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
[0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2]
>>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
torch.Size([1, 23, 768])
tensor([[-0.1423,  0.3766,  0.0443,  ..., -0.2513, -0.3099,  0.3183],
        [-0.5739,  0.1333,  0.2314,  ..., -0.1240, -0.1219,  0.2033],
        [-0.1579,  0.1335,  0.0291,  ...,  0.2340, -0.8801,  0.6216],
        ...,
        [-0.4042,  0.2284,  0.5241,  ..., -0.2046, -0.2419,  0.7031],
        [-0.3894,  0.4603,  0.4797,  ..., -0.3335, -0.6049,  0.4730],
        [-0.1433,  0.3785,  0.0450,  ..., -0.2527, -0.3121,  0.3207]],
       grad_fn=<SelectBackward>) 
 

Probing

As stated in the paper, CodeBERT is not suitable for mask prediction task, while CodeBERT (MLM) is suitable for mask prediction task.

We give an example on how to use CodeBERT(MLM) for mask prediction task.

from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")

CODE = "if (x is not None) 
   
     (x>1)"
   
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(CODE)
print(outputs)

Results

'and', 'or', 'if', 'then', 'AND'

The detailed outputs are as follows:

{'sequence': ' if (x is not None) and (x>1)', 'score': 0.6049249172210693, 'token': 8}
{'sequence': ' if (x is not None) or (x>1)', 'score': 0.30680200457572937, 'token': 50}
{'sequence': ' if (x is not None) if (x>1)', 'score': 0.02133703976869583, 'token': 114}
{'sequence': ' if (x is not None) then (x>1)', 'score': 0.018607674166560173, 'token': 172}
{'sequence': ' if (x is not None) AND (x>1)', 'score': 0.007619690150022507, 'token': 4248}

Downstream Tasks

For Code Search and Code Docsmentation Generation tasks, please refer to the CodeBERT folder.

GraphCodeBERT

This repo also provides the code for reproducing the experiments in GraphCodeBERT: Pre-training Code Representations with Data Flow. GraphCodeBERT a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

For downstream tasks like code search, clone detection, code refinement and code translation, please refer to the GraphCodeBERT folder.

Contact

Feel free to contact Daya Guo ([email protected]), Duyu Tang ([email protected]), Shuai Lu ([email protected]) and Nan Duan ([email protected]) if you have any further questions.

Comments

How to finetune CodeBERT to do a 4 class classification task.

Hi,

Recently I have been looking and experimenting the clone detection variant of CodeBERT to perform a 4-class classification problem. But it seems the model is only predicting 2 classes despite the task for which I am training the model having 4 classes in the data.jsonl, train.txt, valid.txt etc... Is it possible to use the examples provided to do a multi-class classification problem using CodeBERT or right now, out of the box, it is only able to solve a binary classification (using codedetection folder) problem ?

Thanks a lot

opened by PedroFortunatoEsteves 24
Convert clonedetection example to multitask/multilabel
Hi, right now the GraphCodeBERT clone detection performs binary classification to decide whether 2 pieces of code are semantically equivalent or not.

The problem I am trying to solve is: Given a natural language utterance and two code pieces (A and B) as input to my model, determine whether:

both pieces are correct

piece A is correct and piece B is wrong

piece B is wrong and piece A is correct

both pieces are wrong

I tried solving this problem as 4 class classification task in https://github.com/microsoft/CodeBERT/issues/53 , but the results were not very good, so right now what I am trying to accomplish is to transform it to a multi-class classification problem with a multi-label/multi-task, classifying each input 2 times:

[0,1] -> Whether A is right or wrong. [0,1] -> Whether B is right or wrong.

Does anyone have any idea on how to accomplish this ?

Thanks a lot
opened by PedroFortunatoEsteves 16
Results on natural language code retrieval

Why is the MA-AVG of CODEBERT (MLM, INIT=R) about 3% higher than that of PT W/ CODE ONLY (INIT=R)?

Is it because their network structure is different?

But as described in the paper, they use the same network architecture and objective function MLM:

We develop CodeBERT by using exactly the same model architecture as RoBERTa-base. The total number of model parameters is 125M.

Is it because they use different pre-training data? As described in the paper:

In the MLM objective, only bimodal data (i.e. datapoints of NL-PL pairs) is used for training. RoBERTa which is continuously trained with masked language modeling on codes only.

opened by skye95git 15
Code stuck infinitely when performing Fine-Tuning

When running the fine-tune operation, the script gets stuck at the following warning.

Epoch: 0%| | 0/8 [00:00<?, ?it/s]/home/akash/.local/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:224: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr(). warnings.warn("To get the last learning rate computed by the scheduler, "

opened by akash-isu 13
Model Inferencing

Hi, I'm using clone detection in GraphCodeBERT and I'm trying to do model inference from the saved checkpoint. I'm trying to load the saved model from a folder called "roberta" with all the pre-requisits that I got from hugging-face transformers. While Inferencing using the code given to inference "!python run.py --output_dir=saved_models --config_name=roberta --model_name_or_path=roberta --tokenizer_name=roberta --do_eval --train_data_file=dataset/train.txt --eval_data_file=dataset/valid.txt --test_data_file=dataset/test.txt --epoch 1 --code_length 128 --eval_batch_size 4 --learning_rate 2e-5 --max_grad_norm 1.0 --evaluate_during_training" So while running this command that I'm trying to load from the folder "roberta". I'm facing an error as ***"ValueError: The state dictionary of the model you are training to load is corrupted. Are you sure it was properly saved?"***. I load the model separately and try to print it. The model is in OrderedDict instead of StateDict. I tried rerunning and fine tune the model once again to save the model correctly and I see it doing the same things and saving in an OrderedDict format. And maybe this is causing the error.

PS: What I'm trying to do is, Given two pieces of code in the dataset.json and the test.txt when all the indexes and actual predictions are mentioned I need to to inference on the saved model to give me the prediction whether it is similar or not. Either 0 or 1. How do I go about this, If possible give me a brief on how to do the process only for the above mentioned problem.

opened by lokesh-ixo 11
How to use the pre-trained model of codebert?

I've run the demo.py of the Siamese-model. The log is `Query: set a variable as hello world Code: print('hello world') Score: 2.4148944177682097e-08 Code: s = 'hello world' Score: 0.999518632888794 Code: hello world Score: 0.00048138442798517644

Query: Download an image and save the content in output_dir Code: def f(image_url, output_dir): import requests r = requests.get(image_url) with open(output_dir, 'wb') as f: f.write(r.content)

Score: 0.9694535732269287 Code: def f(image, output_dir): with open(output_dir, 'wb') as f: f.write(image)

Score: 9.678478818386793e-05 Code: def f(image_url, output_dir): import requests r = requests.get(image_url) return r.content

Score: 0.03044973686337471`

But when I run the evaluation code: python run.py \ --output_dir=./saved_models/python \ --config_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --tokenizer_name=microsoft/codebert-base \ --do_eval \ --do_test \ --train_data_file=dataset/python/train.jsonl \ --eval_data_file=dataset/python/valid.jsonl \ --test_data_file=dataset/python/test.jsonl \ --codebase_file=dataset/python/codebase.jsonl \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 2>&1| tee saved_models/python/test.log

An error has occurred saved_models/python/test.log: No such file or directory.

I have a few questions: 1.Is the code tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") in the demo.py used to load the word splitter? But the parameter microsoft/codebert-base does not exist in the Siamese-model file. Do I need to create it myself? What files do I need to store after I create them?

2.The run command for codebert also has the Microsoft /codebert-base parameter,for example: python run_classifier.py \ --model_type roberta \ --model_name_or_path microsoft/codebert-base \ --task_name codesearch \ --do_predict \ --output_dir ./models/$lang \ --data_dir ../data/codesearch/test/$lang \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --test_file batch_${idx}.txt \ --pred_model_dir ./models/$lang/checkpoint-best/ \ --test_result_dir ./results/$lang/${idx}_batch_result.txt

And codes in Quick Tour: device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") model = RobertaModel.from_pretrained("microsoft/codebert-base")

Do I need to create it myself? What files do I need to store after I create them?

3.I tried to evaluate the pre-training model, but got an error saved_models/python/test.log: No such file or directory. Do I need to retrain the model by myself?

opened by skye95git 11

Using CodeBERT for code based semantic search / clustering

Hi,

I am interested in using CodeBERT for semantic text similarity / clustering on code but my results are rather poor. Here is my process:

Download the data:

mkdir data data/codesearch
cd data/codesearch
gdown https://drive.google.com/uc?id=1xgSR34XO8xXZg4cZScDYj2eGerBE9iGo  
unzip codesearch_data.zip
rm codesearch_data.zip

Grab some examples to embed:

from pathlib import Path

max_instances = 8

valid = Path("data/codesearch/train_valid/python/valid.txt").read_text().split("\n")
code = [ex.split("<CODESPLIT>")[-1] for ex in valid][:max_instances]

Embed the examples

import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

# Load the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model = model.to(device)

# Prepare the inputs
inputs = tokenizer(
    code, padding=True, truncation=True, return_tensors="pt"
)

# Embed the inputs
for name, tensor in inputs.items():
    inputs[name] = tensor.to(model.device)
sequence_output, _ = model(**inputs, output_hidden_states=False)
embeddings = sequence_output[:, 0, :]

Then I arbitrarily cosine the first inputs embedding with the rest of the inputs embeddings:

from torch.nn import CosineSimilarity

# Perform a cosine based semantic similarity search, using the first function as query 
sim = CosineSimilarity(dim=-1)
cosine = sim(embeddings[0], embeddings[1:])
scores, indices = cosine.topk(5)

print(f"Scores: {scores.tolist()}")
print()
print(f"Query:\n---\n{code[0]}")
print()
topk = '\n'.join([code[i] for i in indices])
print(f"Top K:\n---\n{topk}")

The output:

Scores: [0.9909096360206604, 0.9864522218704224, 0.9837372899055481, 0.9776582717895508, 0.9704807996749878]

Query:
---
def start_transaction ( self , sort , address , price = None , data = None , caller = None , value = 0 , gas = 2300 ) : assert self . _pending_transaction is None , "Already started tx" self . _pending_transaction = PendingTransaction ( sort , address , price , data , caller , value , gas )

Top K:
---
def remove_node ( self , id ) : if self . has_key ( id ) : n = self [ id ] self . nodes . remove ( n ) del self [ id ] # Remove all edges involving id and all links to it. for e in list ( self . edges ) : if n in ( e . node1 , e . node2 ) : if n in e . node1 . links : e . node1 . links . remove ( n ) if n in e . node2 . links : e . node2 . links . remove ( n ) self . edges . remove ( e )
def find_essential_genes ( model , threshold = None , processes = None ) : if threshold is None : threshold = model . slim_optimize ( error_value = None ) * 1E-02 deletions = single_gene_deletion ( model , method = 'fba' , processes = processes ) essential = deletions . loc [ deletions [ 'growth' ] . isna ( ) | ( deletions [ 'growth' ] < threshold ) , : ] . index return { model . genes . get_by_id ( g ) for ids in essential for g in ids }
async def play_now ( self , requester : int , track : dict ) : self . add_next ( requester , track ) await self . play ( ignore_shuffle = True )
def _handleAuth ( fn ) : @ functools . wraps ( fn ) def wrapped ( * args , * * kwargs ) : # auth, , authenticate users, internal from yotta . lib import auth # if yotta is being run noninteractively, then we never retry, but we # do call auth.authorizeUser, so that a login URL can be displayed: interactive = globalconf . get ( 'interactive' ) try : return fn ( * args , * * kwargs ) except requests . exceptions . HTTPError as e : if e . response . status_code == requests . codes . unauthorized : #pylint: disable=no-member logger . debug ( '%s unauthorised' , fn ) # any provider is sufficient for registry auth auth . authorizeUser ( provider = None , interactive = interactive ) if interactive : logger . debug ( 'retrying after authentication...' ) return fn ( * args , * * kwargs ) raise return wrapped
def write_log ( log_path , data , allow_append = True ) : append = os . path . isfile ( log_path ) islist = isinstance ( data , list ) if append and not allow_append : raise Exception ( 'Appending has been disabled' ' and file %s exists' % log_path ) if not ( islist or isinstance ( data , Args ) ) : raise Exception ( 'Can only write Args objects or dictionary' ' lists to log file.' ) specs = data if islist else data . specs if not all ( isinstance ( el , dict ) for el in specs ) : raise Exception ( 'List elements must be dictionaries.' ) log_file = open ( log_path , 'r+' ) if append else open ( log_path , 'w' ) start = int ( log_file . readlines ( ) [ - 1 ] . split ( ) [ 0 ] ) + 1 if append else 0 ascending_indices = range ( start , start + len ( data ) ) log_str = '\n' . join ( [ '%d %s' % ( tid , json . dumps ( el ) ) for ( tid , el ) in zip ( ascending_indices , specs ) ] ) log_file . write ( "\n" + log_str if append else log_str ) log_file . close ( )

Notice that the cosine is very high for the top-5 examples, which is unexpected as these examples are chosen randomly. Manually inspecting them, they don't appear to be very relevant to the query.

My questions:

Am I doing something wrong?
Is there a better way to do semantic similarity searching/clustering with CodeBERT? Here I am following the canonical pipeline for sentence embeddings.
One possible source of error is the tokenization. Am I supposed to use the CodeBERT tokenizer on code, or just text?

opened by JohnGiorgi 11

Issues related to experimentation results
Hi,

I have constructed a new dataset [train.txt, test.txt, valid.txt] with the following format:

1<CODESPLIT>URL<CODESPLIT>returnType.methodName<CODESPLIT>[docString]<CODESPLIT>[code]

I have placed constant values such as “1”, “URL”, and ”returnType.methodName” for the whole dataset. When I run following script, I have gotten results such as [acc = 1.0, acc_and_f1 = 1.0, and f1 = 1.0]:

python CodeBERT/codesearch/run_classifier.py \ --model_type roberta \ --task_name codesearch \ --do_train \ --do_eval \ --eval_all_checkpoints \ --train_file train.txt \ --dev_file valid.txt \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --gradient_accumulation_steps 1 \ --overwrite_output_dir \ --data_dir CodeBERT/data/train_valid\ --output_dir CodeBERT/models \ --model_name_or_path CodeBERT/pretrained_models/pretrained_codebert

Following are the learning rate and loss graphs:

However, when I run following two scripts, I achieve MRR as 0.0031. I am not sure, why is it like that? Why it is so less MRR value?

python CodeBERT/codesearch/run_classifier.py \ --model_type roberta \ --model_name_or_path CodeBERT/models \ --task_name codesearch \ --do_predict \ --output_dir CodeBERT/data/train_valid \ --data_dir CodeBERT/data/train_valid \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --test_file test.txt \ --pred_model_dir CodeBERT/models \ --test_result_dir CodeBERT/results/result.txt

python CodeBERT/codesearch/mrr.py

Secondly, does Table 2 in the paper represent MRR values generated from the above scripts?

Finally, what is the difference between jsonl and text file format data? I guess jsonl format files are used in document generation experiments? For this purpose, I construct jsonl files having the same data but in jsonl format as follows. Only code_tokens and docstring_tokens contain token list of code snippet and natural langunge description. Is it a right approach?

`{"repo": "", "path": "", "func_name": "", "original_string": "", "language": "lang", "code": "", "code_tokens": [], "docstring": "", "docstring_tokens": [], "sha": "", "url": "", "partition": ""}

Kindly, let me know about my concerns. `
opened by hmdgit 11
About the construction of DFG

Thank you for your great work! The code is very clear and concise to read.

I would like to ask about the logic behind each function in DFG.py. I would really like to implement CFG with reference, because I think there are times when CFG might be useful to understand the code as well.

opened by wangdeze18 10
can I get embedding for java code snippet?
I am looking into the following example to extract code embedding.

# Encode maximum function func = "def f(a,b): if a>b: return a else return b" tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>") source_ids = torch.tensor(tokens_ids).to(device) tokens_embeddings,max_func_embedding = model(source_ids)

So I suppose I can get embedding for a python function from this max_func_embedding.

However, I have the following three questions:

a) Can I use CodeBERT to extract embedding for Java code?

b) Can I feed incomplete JavaScript code to extract embedding? Or the code snippet needs to be complete?

And most importantly: c) Can I feed multiple function and get embedding for the whole snippet?

Lets say the code snippet has two functions and slightly incomplete:

testCreateProcessDefinitionQuery ( ) { org . foxbpm . engine . repository . ProcessDefinitionQuery processQuery = modelService . createProcessDefinitionQuery ( ) ; "<AssertPlaceHolder>" ; } createProcessDefinitionQuery ( ) { return new org . foxbpm . engine . impl . model . ProcessDefinitionQueryImpl ( commandExecutor ) ; }

Will Unixcoder even generate embedding for this case?
opened by ramsey-coding 6
Where could I find the code for comparison in CodeSearch task?

Thanks for your great work! Could you please provide the code for the comparison mentioned in the paper? I would like to reproduce your work? Could you please send me the code to my email ([email protected]) if possible?

Many thanks in advance.

opened by iamfaith 6
Unixcoder的mask_attention矩阵

请问如何理解unxicoder.py中的UniXcoder类forward函数中的mask_attention矩阵： mask = source_ids.ne(self.config.pad_token_id) attention_mask = mask.unsqueeze(1) * mask.unsqueeze(2)

为什么mask_attention不是source_ids.ne(self.config.pad_token_id)呢？感谢您的回复！

opened by MrBlack0220 0
cls_token representation

Hello,

I am not quite sure I understand how the cls token can be 1 as shown in the datasets. When I try to replicate your work, I get a 0 cls token for every function I put through the tokenizer. Could you please explain?

Thanks

opened by frede791 0
Questions about the inputs for getting embeddings
Hi,

Thanks for your work. I tried to use CodeBERT, GraphCodeBERT and UnixCoder to extract Java code embeddings. However, for inputs to the models, I only used the Java source code, something like [CLS][JavaCode][SEP].

Should I also add comments to the inputs?

For GraphCodeBERT and UnixCoder, should I also add dataflow and also the flattened AST as input? Since I care about the execution time of the approach, so would adding that information (Comments, Dataflow and AST) make the time for getting embeddings much longer?

I would appreciate your kind suggestions,

Thanks.
opened by rongqipan 0
CodeBERT pre-training data
Dear authors,

I had a question regarding the pre-training of CodeBERT. How is the pre-training data structured exactly?

In section 3.2 of the paper, it is stated that the pre-training data is structured as [CLS] + [NL_tokens] + [SEP] + [PL_tokens] + [EOS]. In section 3.3 and in the CodeSearchNet data, the Natural language is inserted after the function definition. Which of these was used to pre-train CodeBERT?

Would it be possible to share (some of) the pre-training samples with the exact pre-processing applied?

Thanks in advance,

Ali
opened by aalkaswan 0
Rust weights for CodeBERT MLM

Hello,

I am reaching out regarding the pull request that was opened a few months ago on the Huggingface Model Hub: https://huggingface.co/microsoft/codebert-base-mlm/discussions/1.

I am maintaining a Rust implementation of transformers language models (see repository) and would be interested in making pretrained weights for CodeBERT available. Could you please advise if there are any blockers with merging the addition of the Rust weights?

Thank you!

opened by guillaume-be 0
Accuracy of the generated data flow
If I understand correctly, I can use the GraphCodeBERT/clonedetection/parser/DFG.py library to extract DFG for Java code.

Steps seems to be:

use the custom tree-sitter package to parse a tree

then construct the final DFG based on the parsed tree.

I have a few questions:

Can they handle inter-procedural data-flow analysis?

How accurate are the generated data-flow graph?

Looking forward to your response.
opened by smith-co 1

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

9 Nov 7, 2022

Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

1.2k Dec 30, 2022

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

117 Jan 7, 2023

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

564 Jan 8, 2023

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

48 Dec 14, 2022

Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

47 Sep 10, 2022

A programming language with logic of Python, and syntax of all languages.

Pytov The idea was to take all well known syntaxes, and combine them into one programming language with many posabilities. Installation Install using

14 Dec 7, 2022

Tools and data for measuring the popularity & growth of various programming languages.

growth-data Tools and data for measuring the popularity & growth of various programming languages. Install the dependencies $ pip install -r requireme

3 Jan 6, 2022

Contains descriptions and code of the mini-projects developed in various programming languages

TexttoSpeechAndLanguageTranslator-project introduction A pleasant application where the client will be given buttons like play,reset and exit. The cli

1 Dec 22, 2021

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

6 Apr 29, 2022

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

148 Dec 26, 2022

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022

Must-read papers on improving efficiency for pre-trained language models.

89 Jan 3, 2023

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

2.3k Jan 8, 2023

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021