Korean-Sentence-Embedding
Baseline Models
Baseline models used for korean sentence embedding - KLUE-PLMs
Model | Embedding size | Hidden size | # Layers | # Heads |
---|---|---|---|---|
KLUE-BERT-base | 768 | 768 | 12 | 12 |
KLUE-RoBERTa-base | 768 | 768 | 12 | 12 |
NOTE
: All the pretrained models are uploaded in Huggingface Model Hub. Check https://huggingface.co/klue.
How to start
- Get datasets to train or test.
bash get_model_dataset.sh
- If you want to do inference quickly, download the pre-trained models and then you can start some downstream tasks.
bash get_model_checkpoint.sh
cd KoSBERT/
python SemanticSearch.py
Available Models
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [SBERT]-[EMNLP 2019]
- SimCSE: Simple Contrastive Learning of Sentence Embeddings [SimCSE]-[EMNLP 2021]
KoSentenceBERT
-
๐ค Model Training - Dataset
- Train: snli_1.0_train.ko.tsv (First phase, training NLI), sts-train.tsv (Second phase, continued training STS)
- Valid: sts-dev.tsv
- Test: sts-test.tsv
KoSimCSE
-
๐ค Model Training - Dataset
- Train: snli_1.0_train.ko.tsv + multinli.train.ko.tsv
- Valid: sts-dev.tsv
- Test: sts-test.tsv
Performance
- Semantic Textual Similarity test set results
Model | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman |
---|---|---|---|---|---|---|---|---|
KoSBERTโ SKT | 78.81 | 78.47 | 77.68 | 77.78 | 77.71 | 77.83 | 75.75 | 75.22 |
KoSBERTbase | 82.13 | 82.25 | 80.67 | 80.75 | 80.69 | 80.78 | 77.96 | 77.90 |
KoSRoBERTabase | 80.70 | 81.03 | 80.97 | 81.06 | 80.84 | 80.97 | 79.20 | 78.93 |
KoSimCSE-BERTโ SKT | 82.12 | 82.56 | 81.84 | 81.63 | 81.99 | 81.74 | 79.55 | 79.19 |
KoSimCSE-BERTbase | 82.73 | 83.51 | 82.32 | 82.78 | 82.43 | 82.88 | 77.86 | 76.70 |
KoSimCSE-RoBERTabase | 83.64 | 84.05 | 83.32 | 83.84 | 83.33 | 83.79 | 80.92 | 79.84 |
Downstream Tasks
- KoSBERT: Semantic Search, Clustering
python SemanticSearch.py
python Clustering.py
- KoSimCSE: Semantic Search
python SemanticSearch.py
Semantic Search (KoSBERT)
from sentence_transformers import SentenceTransformer, util
import numpy as np
model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'
embedder = SentenceTransformer(model_path)
# Corpus with example sentences
corpus = ['ํ ๋จ์๊ฐ ์์์ ๋จน๋๋ค.',
'ํ ๋จ์๊ฐ ๋นต ํ ์กฐ๊ฐ์ ๋จน๋๋ค.',
'๊ทธ ์ฌ์๊ฐ ์์ด๋ฅผ ๋๋ณธ๋ค.',
'ํ ๋จ์๊ฐ ๋ง์ ํ๋ค.',
'ํ ์ฌ์๊ฐ ๋ฐ์ด์ฌ๋ฆฐ์ ์ฐ์ฃผํ๋ค.',
'๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์ฆ์ผ๋ก ๋ฐ์๋ค.',
'ํ ๋จ์๊ฐ ๋ด์ผ๋ก ์ธ์ธ ๋
์์ ๋ฐฑ๋ง๋ฅผ ํ๊ณ ์๋ค.',
'์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค.',
'์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค.']
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
# Query sentences:
queries = ['ํ ๋จ์๊ฐ ํ์คํ๋ฅผ ๋จน๋๋ค.',
'๊ณ ๋ฆด๋ผ ์์์ ์
์ ๋๊ตฐ๊ฐ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๊ณ ์๋ค.',
'์นํ๊ฐ ๋คํ์ ๊ฐ๋ก ์ง๋ฌ ๋จน์ด๋ฅผ ์ซ๋๋ค.']
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
cos_scores = cos_scores.cpu()
#We use np.argpartition, to only partially sort the top_k results
top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in corpus:")
for idx in top_results[0:top_k]:
print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
- Results are as follows :
Query: ํ ๋จ์๊ฐ ํ์คํ๋ฅผ ๋จน๋๋ค.
Top 5 most similar sentences in corpus:
ํ ๋จ์๊ฐ ์์์ ๋จน๋๋ค. (Score: 0.6141)
ํ ๋จ์๊ฐ ๋นต ํ ์กฐ๊ฐ์ ๋จน๋๋ค. (Score: 0.5952)
ํ ๋จ์๊ฐ ๋ง์ ํ๋ค. (Score: 0.1231)
ํ ๋จ์๊ฐ ๋ด์ผ๋ก ์ธ์ธ ๋
์์ ๋ฐฑ๋ง๋ฅผ ํ๊ณ ์๋ค. (Score: 0.0752)
๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์ฆ์ผ๋ก ๋ฐ์๋ค. (Score: 0.0486)
======================
Query: ๊ณ ๋ฆด๋ผ ์์์ ์
์ ๋๊ตฐ๊ฐ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๊ณ ์๋ค.
Top 5 most similar sentences in corpus:
์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค. (Score: 0.6656)
์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค. (Score: 0.2988)
ํ ์ฌ์๊ฐ ๋ฐ์ด์ฌ๋ฆฐ์ ์ฐ์ฃผํ๋ค. (Score: 0.1566)
ํ ๋จ์๊ฐ ๋ง์ ํ๋ค. (Score: 0.1112)
ํ ๋จ์๊ฐ ๋ด์ผ๋ก ์ธ์ธ ๋
์์ ๋ฐฑ๋ง๋ฅผ ํ๊ณ ์๋ค. (Score: 0.0262)
======================
Query: ์นํ๊ฐ ๋คํ์ ๊ฐ๋ก ์ง๋ฌ ๋จน์ด๋ฅผ ์ซ๋๋ค.
Top 5 most similar sentences in corpus:
์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค. (Score: 0.7570)
๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์ฆ์ผ๋ก ๋ฐ์๋ค. (Score: 0.3658)
์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค. (Score: 0.3583)
ํ ๋จ์๊ฐ ๋ง์ ํ๋ค. (Score: 0.0505)
๊ทธ ์ฌ์๊ฐ ์์ด๋ฅผ ๋๋ณธ๋ค. (Score: -0.0087)
Clustering (KoSBERT)
from sentence_transformers import SentenceTransformer, util
import numpy as np
model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'
embedder = SentenceTransformer(model_path)
# Corpus with example sentences
corpus = ['ํ ๋จ์๊ฐ ์์์ ๋จน๋๋ค.',
'ํ ๋จ์๊ฐ ๋นต ํ ์กฐ๊ฐ์ ๋จน๋๋ค.',
'๊ทธ ์ฌ์๊ฐ ์์ด๋ฅผ ๋๋ณธ๋ค.',
'ํ ๋จ์๊ฐ ๋ง์ ํ๋ค.',
'ํ ์ฌ์๊ฐ ๋ฐ์ด์ฌ๋ฆฐ์ ์ฐ์ฃผํ๋ค.',
'๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์ฆ์ผ๋ก ๋ฐ์๋ค.',
'ํ ๋จ์๊ฐ ๋ด์ผ๋ก ์ธ์ธ ๋
์์ ๋ฐฑ๋ง๋ฅผ ํ๊ณ ์๋ค.',
'์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค.',
'์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค.',
'ํ ๋จ์๊ฐ ํ์คํ๋ฅผ ๋จน๋๋ค.',
'๊ณ ๋ฆด๋ผ ์์์ ์
์ ๋๊ตฐ๊ฐ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๊ณ ์๋ค.',
'์นํ๊ฐ ๋คํ์ ๊ฐ๋ก ์ง๋ฌ ๋จน์ด๋ฅผ ์ซ๋๋ค.']
corpus_embeddings = embedder.encode(corpus)
# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in enumerate(clustered_sentences):
print("Cluster ", i+1)
print(cluster)
print("")
- Results are as follows:
Cluster 1
['ํ ๋จ์๊ฐ ์์์ ๋จน๋๋ค.', 'ํ ๋จ์๊ฐ ๋นต ํ ์กฐ๊ฐ์ ๋จน๋๋ค.', 'ํ ๋จ์๊ฐ ํ์คํ๋ฅผ ๋จน๋๋ค.']
Cluster 2
['์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค.', '๊ณ ๋ฆด๋ผ ์์์ ์
์ ๋๊ตฐ๊ฐ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๊ณ ์๋ค.']
Cluster 3
['ํ ๋จ์๊ฐ ๋ง์ ํ๋ค.', '๋ ๋จ์๊ฐ ์๋ ๋ฅผ ์ฒ ์ฆ์ผ๋ก ๋ฐ์๋ค.', 'ํ ๋จ์๊ฐ ๋ด์ผ๋ก ์ธ์ธ ๋
์์ ๋ฐฑ๋ง๋ฅผ ํ๊ณ ์๋ค.']
Cluster 4
['์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค.', '์นํ๊ฐ ๋คํ์ ๊ฐ๋ก ์ง๋ฌ ๋จน์ด๋ฅผ ์ซ๋๋ค.']
Cluster 5
['๊ทธ ์ฌ์๊ฐ ์์ด๋ฅผ ๋๋ณธ๋ค.', 'ํ ์ฌ์๊ฐ ๋ฐ์ด์ฌ๋ฆฐ์ ์ฐ์ฃผํ๋ค.']
References
@misc{park2021klue,
title={KLUE: Korean Language Understanding Evaluation},
author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
year={2021},
eprint={2105.09680},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{gao2021simcse,
title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
year={2021}
}
@article{ham2020kornli,
title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
journal={arXiv preprint arXiv:2004.03289},
year={2020}
}
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}