CodeBERT-Implementation
In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages.
We are interested in evaluating CodeBERT specifically in Natural language code search. Given a natural language as the input, the objective of code search is to find the most semantically related code from a collection of codes.
This code was implemented on a 64-bit Windows system with 8 GB ram and GeForce GTX 1650 4GB graphics card.
Due to limited compuational power, we have trained and evaluated the model on a smaller data compared to the original data.
Language | Training data size | Validation data size | Test data size for batch_0 | |||
---|---|---|---|---|---|---|
Original | Our | Original | Our | Original | Our | |
Ruby | 97580 | 500 | 4417 | 100 | 1000000 | 20000 |
Go | 635653 | 500 | 28483 | 100 | 1000000 | 20000 |
PHP | 1047404 | 500 | 52029 | 100 | 1000000 | 20000 |
Python | 824342 | 500 | 46213 | 100 | 1000000 | 20000 |
Java | 908886 | 500 | 30655 | 100 | 1000000 | 20000 |
Javascript | 247773 | 500 | 16505 | 100 | 1000000 | 20000 |
Compared to the code in original repo, code in this repo can be implemented directly in Windows system without any hindrance. We have already provided a subset of pre-processed data for batch_0 (shown in table under Testing data size) in ./data/codesearch/test/
Fine tuning pretrained model CodeBERT on individual languages
lang = go
cd CodeBERT-Implementation
! python run_classifier.py --model_type roberta --task_name codesearch --do_train --do_eval --eval_all_checkpoints --train_file train_short.txt --dev_file valid_short.txt --max_seq_length 50 --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 8 --learning_rate 1e-5 --num_train_epochs 1 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir CodeBERT-Implementation/data/codesearch/train_valid/$lang/ --output_dir ./models/$lang/ --model_name_or_path microsoft/codebert-base
Inference and Evaluation
lang = go
idx = 0
! python run_classifier.py --model_type roberta --model_name_or_path microsoft/codebert-base --task_name codesearch --do_predict --output_dir CodeBERT-Implementation/data/models/$lang --data_dir CodeBERT-Implementation/data/codesearch/test/$lang/ --max_seq_length 50 --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 8 --learning_rate 1e-5 --num_train_epochs 1 --test_file batch_short_${idx}.txt --pred_model_dir ./models/ruby/checkpoint-best/ --test_result_dir ./results/$lang/${idx}_batch_result.txt
! python mrr.py
The Mean Evaluation Rank (MER), the evaluation mteric, for the subset of data is given as follows:
Language | MER |
---|---|
Ruby | 0.0037 |
Go | 0.0034 |
PHP | 0.0044 |
Python | 0.0052 |
Java | 0.0033 |
Java script | 0.0054 |
The accuracy is way less than what is reported in the paper. However, the purpose of this repo is to provide the user, ready to implement data of CodeBERT without any heavy downloads. We have also included the prediction results in this repo corresponding to the test data.