Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

Overview

KoSimCSE

  • Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch

Installation

git clone https://github.com/BM-K/KoSimCSE.git
cd KoSimCSE
git clone https://github.com/SKTBrain/KoBERT.git
cd KoBERT
pip install -r requirements.txt
pip install .
cd ..
pip install -r requirements.txt

Training - only supervised

  • Model

  • Dataset

  • Setting

    • epochs: 3
    • dropout: 0.1
    • batch size: 256
    • temperature: 0.05
    • learning rate: 5e-5
    • warm-up ratio: 0.05
    • max sequence length: 50
    • evaluation steps during training: 250
  • Run train -> test -> semantic_search

bash run_example.sh

Pre-Trained Models

  • Using BERT [CLS] token representation
  • Pre-Trained model check point

Performance

Model Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
KoSBERT_SKT* 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22
KoSimCSE_SKT 81.55 82.11 81.70 81.69 81.65 81.60 78.19 77.18

Example Downstream Task

Semantic Search

python SemanticSearch.py
import numpy as np
from model.utils import pytorch_cos_sim
from data.dataloader import convert_to_tensor, example_model_setting


def main():
    model_ckpt = './output/nli_checkpoint.pt'
    model, transform, device = example_model_setting(model_ckpt)

    # Corpus with example sentences
    corpus = ['한 남자가 음식을 먹는다.',
              '한 남자가 빵 한 조각을 먹는다.',
              '그 여자가 아이를 돌본다.',
              '한 남자가 말을 탄다.',
              '한 여자가 바이올린을 연주한다.',
              '두 남자가 수레를 숲 속으로 밀었다.',
              '한 남자가 담으로 싸인 땅에서 백마를 타고 있다.',
              '원숭이 한 마리가 드럼을 연주한다.',
              '치타 한 마리가 먹이 뒤에서 달리고 있다.']

    inputs_corpus = convert_to_tensor(corpus, transform)

    corpus_embeddings = model.encode(inputs_corpus, device)

    # Query sentences:
    queries = ['한 남자가 파스타를 먹는다.',
               '고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.',
               '치타가 들판을 가로 질러 먹이를 쫓는다.']

    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = 5
    for query in queries:
        query_embedding = model.encode(convert_to_tensor([query], transform), device)
        cos_scores = pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
        cos_scores = cos_scores.cpu().detach().numpy()

        top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

        print("\n\n======================\n\n")
        print("Query:", query)
        print("\nTop 5 most similar sentences in corpus:")

        for idx in top_results[0:top_k]:
            print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))

Result

Query: 한 남자가 파스타를 먹는다.

Top 5 most similar sentences in corpus:
한 남자가 음식을 먹는다. (Score: 0.6002)
한 남자가 빵 한 조각을 먹는다. (Score: 0.5938)
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.0696)
한 남자가 말을 탄다. (Score: 0.0328)
원숭이 한 마리가 드럼을 연주한다. (Score: -0.0048)


======================


Query: 고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.

Top 5 most similar sentences in corpus:
원숭이 한 마리가 드럼을 연주한다. (Score: 0.6489)
한 여자가 바이올린을 연주한다. (Score: 0.3670)
한 남자가 말을 탄다. (Score: 0.2322)
그 여자가 아이를 돌본다. (Score: 0.1980)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1628)


======================


Query: 치타가 들판을 가로 질러 먹이를 쫓는다.

Top 5 most similar sentences in corpus:
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.7756)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.1814)
한 남자가 말을 탄다. (Score: 0.1666)
원숭이 한 마리가 드럼을 연주한다. (Score: 0.1530)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1270)

Citing

SimCSE

@article{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   journal={arXiv preprint arXiv:2104.08821},
   year={2021}
}

KorNLU Datasets

@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}
You might also like...
InferSent sentence embeddings
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Korean stereoypte detector with TUNiB-Electra and K-StereoSet
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Generating Korean Slogans with phonetic and structural repetition
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Comments
  • [IndexError] tuple index out of range

    [IndexError] tuple index out of range

    환경 : ColabPro

    !git clone https://github.com/BM-K/KoSimCSE.git
    %cd KoSimCSE
    !git clone https://github.com/SKTBrain/KoBERT.git
    %cd KoBERT
    !pip install -r requirements.txt
    !pip install .
    %cd ..
    !pip install -r requirements.txt
    
    !pip install transformers==4.8.1
    !pip install folium==0.2.1
    !pip install tensorboardX
    

    호환 문제 때문에 해당 패키지들의 버전을 맞추었습니다.

    !chmod +x 
    !/content/KoSimCSE/run_example.sh!/content/KoSimCSE/run_example.sh
    

    해당 코드로 돌렸을 때 output console 입니다.

    Start Training
    argparse{
     	 opt_level : O1
    	 fp16 : True
    	 train : True
    	 test : False
    	 device : cuda
    	 patient : 10
    	 dropout : 0.1
    	 max_len : 50
    	 batch_size : 256
    	 epochs : 3
    	 eval_steps : 250
    	 seed : 1234
    	 lr : 0.0001
    	 weight_decay : 0.0
    	 warmup_ratio : 0.05
    	 temperature : 0.05
    	 train_data : train_nli_sample.tsv
    	 valid_data : valid_sts_sample.tsv
    	 test_data : test_sts.tsv
    	 task : NLU
    	 path_to_data : ./data/
    	 path_to_save : ./output/
    	 path_to_saved_model : ./output/
    	 ckpt : best_checkpoint.pt 
    }
    using cached model
    using cached model
    using cached model
    Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.
    
    Defaults for this optimization level are:
    enabled                : True
    opt_level              : O1
    cast_model_type        : None
    patch_torch_functions  : True
    keep_batchnorm_fp32    : None
    master_weights         : None
    loss_scale             : dynamic
    Processing user overrides (additional kwargs that are not None)...
    After processing overrides, optimization options are:
    enabled                : True
    opt_level              : O1
    cast_model_type        : None
    patch_torch_functions  : True
    keep_batchnorm_fp32    : None
    master_weights         : None
    loss_scale             : dynamic
    Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
    [INFO] 2021-12-23 05:32:45,674 [ Model Setting Complete ] | file::main.py | line::8
    [INFO] 2021-12-23 05:32:45,674 [ Start Training ] | file::main.py | line::11
      0% 0/1 [00:00<?, ?it/s]
    Traceback (most recent call last):
      File "main.py", line 28, in <module>
        main(args, logger)
      File "main.py", line 15, in main
        processor.train(epoch+1)
      File "/content/KoSimCSE/model/simcse/processor.py", line 118, in train
        train_loss = self.run(inputs, type='train')
      File "/content/KoSimCSE/model/simcse/processor.py", line 36, in run
        anchor_embeddings, positive_embeddings, negative_embeddings = self.config['model'](inputs, type)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/content/KoSimCSE/model/simcse/bert.py", line 28, in forward
        attention_mask=positive_attention_mask)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 1001, in forward
        return_dict=return_dict,
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 589, in forward
        output_attentions,
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 475, in forward
        past_key_value=self_attn_past_key_value,
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 408, in forward
        output_attentions,
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 267, in forward
        mixed_query_layer = self.query(hidden_states)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py", line 103, in forward
        return F.linear(input, self.weight, self.bias)
      File "/content/KoSimCSE/apex/amp/wrap.py", line 21, in wrapper
        args[i] = utils.cached_cast(cast_fn, args[i], handle.cache)
      File "/content/KoSimCSE/apex/amp/utils.py", line 97, in cached_cast
        if cached_x.grad_fn.next_functions[1][0].variable is not x:
    **IndexError: tuple index out of range**
    Start Testing
    argparse{
     	 opt_level : O1
    	 fp16 : True
    	 train : False
    	 test : True
    	 device : cuda
    	 patient : 10
    	 dropout : 0.1
    	 max_len : 50
    	 batch_size : 256
    	 epochs : 3
    	 eval_steps : 250
    	 seed : 1234
    	 lr : 5e-05
    	 weight_decay : 0.0
    	 warmup_ratio : 0.05
    	 temperature : 0.05
    	 train_data : train_nli.tsv
    	 valid_data : valid_sts.tsv
    	 test_data : test_sts_sample.tsv
    	 task : NLU
    	 path_to_data : ./data/
    	 path_to_save : ./output/
    	 path_to_saved_model : ./output/best_checkpoint.pt
    	 ckpt : best_checkpoint.pt 
    }
    using cached model
    using cached model
    using cached model
    Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.
    
    Defaults for this optimization level are:
    enabled                : True
    opt_level              : O1
    cast_model_type        : None
    patch_torch_functions  : True
    keep_batchnorm_fp32    : None
    master_weights         : None
    loss_scale             : dynamic
    Processing user overrides (additional kwargs that are not None)...
    After processing overrides, optimization options are:
    enabled                : True
    opt_level              : O1
    cast_model_type        : None
    patch_torch_functions  : True
    keep_batchnorm_fp32    : None
    master_weights         : None
    loss_scale             : dynamic
    Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
    [INFO] 2021-12-23 05:33:01,197 [ Model Setting Complete ] | file::main.py | line::8
    [INFO] 2021-12-23 05:33:01,197 [ Start Test ] | file::main.py | line::18
    Traceback (most recent call last):
      File "main.py", line 28, in <module>
        main(args, logger)
      File "main.py", line 20, in main
        processor.test()
      File "/content/KoSimCSE/model/simcse/processor.py", line 163, in test
        self.config['model'].load_state_dict(torch.load(self.args.path_to_saved_model)['model'], strict=False)
      File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
        with _open_file_like(f, 'rb') as opened_file:
      File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 230, in _open_file_like
        return _open_file(name_or_buffer, mode)
      File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 211, in __init__
        super(_open_file, self).__init__(open(name, mode))
    FileNotFoundError: [Errno 2] No such file or directory: './output/best_checkpoint.pt'
    Semantic Search
    using cached model
    using cached model
    using cached model
    /content/KoSimCSE/data/dataloader.py:178: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)
      inputs = {'source': torch.LongTensor(tensor_corpus),
    
    
    ======================
    
    
    Query: 한 남자가 파스타를 먹는다.
    
    Top 5 most similar sentences in corpus:
    한 남자가 음식을 먹는다. (Score: 0.6002)
    한 남자가 빵 한 조각을 먹는다. (Score: 0.5940)
    치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.0694)
    한 남자가 말을 탄다. (Score: 0.0327)
    원숭이 한 마리가 드럼을 연주한다. (Score: -0.0050)
    
    
    ======================
    
    
    Query: 고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.
    
    Top 5 most similar sentences in corpus:
    원숭이 한 마리가 드럼을 연주한다. (Score: 0.6490)
    한 여자가 바이올린을 연주한다. (Score: 0.3669)
    한 남자가 말을 탄다. (Score: 0.2322)
    그 여자가 아이를 돌본다. (Score: 0.1980)
    한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1627)
    
    
    ======================
    
    
    Query: 치타가 들판을 가로 질러 먹이를 쫓는다.
    
    Top 5 most similar sentences in corpus:
    치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.7756)
    두 남자가 수레를 숲 속으로 밀었다. (Score: 0.1812)
    한 남자가 말을 탄다. (Score: 0.1667)
    원숭이 한 마리가 드럼을 연주한다. (Score: 0.1530)
    한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1269)
    
    

    Processor에서 train 함수 anchor_embeddings, positive_embeddings, negative_embeddings = self.config['model'](inputs, type) 해당 라인에서 문제가 생기는 것은 확인은 했는데 config 생성 당시 문제가 생기는 건지 의문입니다.

    opened by yiy829 1
  • SemanticSearch.py error

    SemanticSearch.py error

    안녕하세요~ SemanticSearch.py 수행시 get_pytorch_kobert_model() 내에서 model, voc 다운로드시 url이 막혀 다음과 같은 에러가 발생합니다.

    Traceback (most recent call last):
      File "/home/motive/PycharmProjects/KoSimCSE_SKT/KoBERT/kobert/utils.py", line 46, in download
        response = requests.get(url, stream=True)
      File "/home/motive/anaconda3/envs/KoSimCSE_SKT/lib/python3.8/site-packages/requests/api.py", line 75, in get
        return request('get', url, params=params, **kwargs)
      File "/home/motive/anaconda3/envs/KoSimCSE_SKT/lib/python3.8/site-packages/requests/api.py", line 61, in request
        return session.request(method=method, url=url, **kwargs)
      File "/home/motive/anaconda3/envs/KoSimCSE_SKT/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
        resp = self.send(prep, **send_kwargs)
      File "/home/motive/anaconda3/envs/KoSimCSE_SKT/lib/python3.8/site-packages/requests/sessions.py", line 655, in send
        r = adapter.send(request, **kwargs)
      File "/home/motive/anaconda3/envs/KoSimCSE_SKT/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
        raise ConnectionError(e, request=request)
    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kobert.blob.core.windows.net', port=443): Max retries exceeded with url: /models/kobert/pytorch/kobert_v1.zip (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2ca9d28a60>: Failed to establish a new connection: [Errno -2] Name or service not known'))
    

    혹시 사전에 다운로드 받아놓은 model, voc 파일이 있다면 공유 받을수 있을지, 다른 우회 방법이 있는지 문의 드립니다~

    아래 경로의 유사 error를 찾아 사전 packaging된 model 및 voc 파일을 로드해도 또다른 pkg. dependancy 문제 및 error가 발생하여 이렇게 문의 드립니다. (참고한 repo. : https://github.com/SKTBrain/KoBERT, https://github.com/SKTBrain/KoBERT/tree/master/kobert_hf)

    opened by ai-motive 1
Owner
Self-softmax
null
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 7, 2023
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
Korean Sentence Embedding Repository

Korean-Sentence-Embedding ?? Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

null 80 Jan 2, 2023
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 6, 2023
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 277 Feb 18, 2021
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 2, 2023
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 4.2k Feb 18, 2021