VarCLR: Variable Representation Pre-training via Contrastive Learning
New: Paper accepted by ICSE 2022. Preprint at arXiv!
This repository contains code and pre-trained models for VarCLR, a contrastive learning based approach for learning semantic representations of variable names that effectively captures variable similarity, with state-of-the-art results on IdBench@ICSE2021.
- VarCLR: Variable Representation Pre-training via Contrastive Learning
Step 0: Install
pip install -e .
Step 1: Load a Pre-trained VarCLR Model
from varclr.models import Encoder
model = Encoder.from_pretrained("varclr-codebert")
Step 2: VarCLR Variable Embeddings
Get embedding of one variable
emb = model.encode("squareslab")
print(emb.shape)
# torch.Size([1, 768])
Get embeddings of list of variables (supports batching)
emb = model.encode(["squareslab", "strudel"])
print(emb.shape)
# torch.Size([2, 768])
Step 2: Get VarCLR Similarity Scores
Get similarity scores of N variable pairs
print(model.score("squareslab", "strudel"))
# [0.42812108993530273]
print(model.score(["squareslab", "average", "max", "max"], ["strudel", "mean", "min", "maximum"]))
# [0.42812108993530273, 0.8849745988845825, 0.8035818338394165, 0.889922022819519]
Get pairwise (N * M) similarity scores from two lists of variables
variable_list = ["squareslab", "strudel", "neulab"]
print(model.cross_score("squareslab", variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832]]
print(model.cross_score(variable_list, variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832],
# [0.4281214475631714, 1.0000004768371582, 0.549992561340332],
# [0.7207341194152832, 0.549992561340332, 1.000000238418579]]
Step 3: Reproduce IdBench Benchmark Results
Load the IdBench benchmark
from varclr.benchmarks import Benchmark
# Similarity on IdBench-Medium
b1 = Benchmark.build("idbench", variant="medium", metric="similarity")
# Relatedness on IdBench-Large
b2 = Benchmark.build("idbench", variant="large", metric="relatedness")
Compute VarCLR scores and evaluate
id1_list, id2_list = b1.get_inputs()
predicted = model.score(id1_list, id2_list)
print(b1.evaluate(predicted))
# {'spearmanr': 0.5248567181503295, 'pearsonr': 0.5249843473193132}
print(b2.evaluate(model.score(*b2.get_inputs())))
# {'spearmanr': 0.8012168379981921, 'pearsonr': 0.8021791703187449}
CodeBERT
Let's compare with the originalcodebert = Encoder.from_pretrained("codebert")
print(b1.evaluate(codebert.score(*b1.get_inputs())))
# {'spearmanr': 0.2056582946575104, 'pearsonr': 0.1995058696927054}
print(b2.evaluate(codebert.score(*b2.get_inputs())))
# {'spearmanr': 0.3909218857993804, 'pearsonr': 0.3378219622284688}
IdBench benchmarks
Results onSimilarity
Method | Small | Medium | Large |
---|---|---|---|
FT-SG | 0.30 | 0.29 | 0.28 |
LV | 0.32 | 0.30 | 0.30 |
FT-cbow | 0.35 | 0.38 | 0.38 |
VarCLR-Avg | 0.47 | 0.45 | 0.44 |
VarCLR-LSTM | 0.50 | 0.49 | 0.49 |
VarCLR-CodeBERT | 0.53 | 0.53 | 0.51 |
Combined-IdBench | 0.48 | 0.59 | 0.57 |
Combined-VarCLR | 0.66 | 0.65 | 0.62 |
Relatedness
Method | Small | Medium | Large |
---|---|---|---|
LV | 0.48 | 0.47 | 0.48 |
FT-SG | 0.70 | 0.71 | 0.68 |
FT-cbow | 0.72 | 0.74 | 0.73 |
VarCLR-Avg | 0.67 | 0.66 | 0.66 |
VarCLR-LSTM | 0.71 | 0.70 | 0.69 |
VarCLR-CodeBERT | 0.79 | 0.79 | 0.80 |
Combined-IdBench | 0.71 | 0.78 | 0.79 |
Combined-VarCLR | 0.79 | 0.81 | 0.85 |
Pre-train your own VarCLR models
Coming soon.
Cite
If you find VarCLR useful in your research, please cite our paper@ICSE2022:
@misc{chen2021varclr,
title={VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning},
author={Qibin Chen and Jeremy Lacomis and Edward J. Schwartz and Graham Neubig and Bogdan Vasilescu and Claire Le Goues},
year={2021},
eprint={2112.02650},
archivePrefix={arXiv},
primaryClass={cs.SE}
}