KoBERT - Korean BERT pre-trained cased (KoBERT)

SK T-Brain

Last update: Jan 2, 2023

Related tags

Text Data & NLP nlp transformers pytorch language-model bert korean-nlp

Overview

KoBERT

KoBERT

Korean BERT pre-trained cased (KoBERT)

Why'?'

구글 BERT base multilingual cased의 한국어 성능 한계

Training Environment

Architecture

predefined_args = {
        'attention_cell': 'multi_head',
        'num_layers': 12,
        'units': 768,
        'hidden_size': 3072,
        'max_length': 512,
        'num_heads': 12,
        'scaled': True,
        'dropout': 0.1,
        'use_residual': True,
        'embed_size': 768,
        'embed_dropout': 0.1,
        'token_type_vocab_size': 2,
        'word_embed': None,
    }

학습셋

데이터	문장	단어
한국어 위키	5M	54M

학습 환경
- V100 GPU x 32, Horovod(with InfiniBand)

사전(Vocabulary)
- 크기 : 8,002
- 한글 위키 기반으로 학습한 토크나이저(SentencePiece)
- Less number of parameters(92M < 110M )

Requirements

see requirements.txt

How to install

Install KoBERT as a python package

pip install git+https://[email protected]/SKTBrain/KoBERT.git@master

If you want to modify source codes, please clone this repository

git clone https://github.com/SKTBrain/KoBERT.git
cd KoBERT
pip install -r requirements.txt

How to use

Using with PyTorch

Huggingface transformers API가 편하신 분은 여기를 참고하세요.

>>> import torch
>>> from kobert import get_pytorch_kobert_model
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
>>> model, vocab  = get_pytorch_kobert_model()
>>> sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)
>>> pooled_output.shape
torch.Size([2, 768])
>>> vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")
>>> # Last Encoding Layer
>>> sequence_output[0]
tensor([[-0.2461,  0.2428,  0.2590,  ..., -0.4861, -0.0731,  0.0756],
        [-0.2478,  0.2420,  0.2552,  ..., -0.4877, -0.0727,  0.0754],
        [-0.2472,  0.2420,  0.2561,  ..., -0.4874, -0.0733,  0.0765]],
       grad_fn=<SelectBackward>)

model은 디폴트로 eval()모드로 리턴됨, 따라서 학습 용도로 사용시 model.train()명령을 통해 학습 모드로 변경할 필요가 있다.

Naver Sentiment Analysis Fine-Tuning with pytorch
- Colab에서 [런타임] - [런타임 유형 변경] - 하드웨어 가속기(GPU) 사용을 권장합니다.

Using with ONNX

>>> import onnxruntime
>>> import numpy as np
>>> from kobert import get_onnx_kobert_model
>>> onnx_path = get_onnx_kobert_model()
>>> sess = onnxruntime.InferenceSession(onnx_path)
>>> input_ids = [[31, 51, 99], [15, 5, 0]]
>>> input_mask = [[1, 1, 1], [1, 1, 0]]
>>> token_type_ids = [[0, 0, 1], [0, 1, 0]]
>>> len_seq = len(input_ids[0])
>>> pred_onnx = sess.run(None, {'input_ids':np.array(input_ids),
>>>                             'token_type_ids':np.array(token_type_ids),
>>>                             'input_mask':np.array(input_mask),
>>>                             'position_ids':np.array(range(len_seq))})
>>> # Last Encoding Layer
>>> pred_onnx[-2][0]
array([[-0.24610452,  0.24282141,  0.25895312, ..., -0.48613444,
        -0.07305173,  0.07560554],
       [-0.24783179,  0.24200465,  0.25520486, ..., -0.4877185 ,
        -0.0727044 ,  0.07536091],
       [-0.24721591,  0.24196623,  0.2560626 , ..., -0.48743123,
        -0.07326943,  0.07650235]], dtype=float32)

ONNX 컨버팅은 soeque1께서 도움을 주셨습니다.

Using with MXNet-Gluon

>>> import mxnet as mx
>>> from kobert import get_mxnet_kobert_model
>>> input_id = mx.nd.array([[31, 51, 99], [15, 5, 0]])
>>> input_mask = mx.nd.array([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = mx.nd.array([[0, 0, 1], [0, 1, 0]])
>>> model, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False)
>>> encoder_layer, pooled_output = model(input_id, token_type_ids)
>>> pooled_output.shape
(2, 768)
>>> vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")
>>> # Last Encoding Layer
>>> encoder_layer[0]
[[-0.24610372  0.24282135  0.2589539  ... -0.48613444 -0.07305248
   0.07560539]
 [-0.24783105  0.242005    0.25520545 ... -0.48771808 -0.07270523
   0.07536077]
 [-0.24721491  0.241966    0.25606337 ... -0.48743105 -0.07327032
   0.07650219]]
<NDArray 3x768 @cpu(0)>

Naver Sentiment Analysis Fine-Tuning with MXNet

Tokenizer

Pretrained Sentencepiece tokenizer

>>> from gluonnlp.data import SentencepieceTokenizer
>>> from kobert import get_tokenizer
>>> tok_path = get_tokenizer()
>>> sp  = SentencepieceTokenizer(tok_path)
>>> sp('한국어 모델을 공유합니다.')
['▁한국', '어', '▁모델', '을', '▁공유', '합니다', '.']

Subtasks

Naver Sentiment Analysis

Dataset : https://github.com/e9t/nsmc

Model	Accuracy
BERT base multilingual cased	0.875
KoBERT	0.901
KoGPT2	0.899

KoBERT와 CRF로 만든 한국어 객체명인식기

https://github.com/eagle705/pytorch-bert-crf-ner

문장을 입력하세요:  SKTBrain에서 KoBERT 모델을 공개해준 덕분에 BERT-CRF 기반 객체명인식기를 쉽게 개발할 수 있었다.
len: 40, input_token:['[CLS]', '▁SK', 'T', 'B', 'ra', 'in', '에서', '▁K', 'o', 'B', 'ER', 'T', '▁모델', '을', '▁공개', '해', '준', '▁덕분에', '▁B', 'ER', 'T', '-', 'C', 'R', 'F', '▁기반', '▁', '객', '체', '명', '인', '식', '기를', '▁쉽게', '▁개발', '할', '▁수', '▁있었다', '.', '[SEP]']
len: 40, pred_ner_tag:['[CLS]', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '[SEP]']
decoding_ner_sentence: [CLS] <SKTBrain:ORG>에서 <KoBERT:POH> 모델을 공개해준 덕분에 <BERT-CRF:POH> 기반 객체명인식기를 쉽게 개발할 수 있었다.[SEP]

Release

v0.2.1
- guide default 'import statements'
v0.2
- download large files from aws s3
- rename functions
v0.1.2
- Guaranteed compatibility with higher versions of transformers
- fix pad token index id
v0.1.1
- 사전(vocabulary)과 토크나이저 통합
v0.1
- 초기 모델 릴리즈

Contacts

KoBERT 관련 이슈는 이곳에 등록해 주시기 바랍니다.

License

KoBERT는 Apache-2.0 라이선스 하에 공개되어 있습니다. 모델 및 코드를 사용할 경우 라이선스 내용을 준수해주세요. 라이선스 전문은 LICENSE 파일에서 확인하실 수 있습니다.

Comments

get_pytorch_kobert_model 불러오기 오류

#BERT 모델, Vocabulary 불러오기 bertmodel, vocab = get_pytorch_kobert_model() 이거 실행시킬 때 자꾸 아래와 같은 오류가 뜹니다. 어떻게 해결하며 좋을까요? ConnectionError: HTTPSConnectionPool(host='kobert.blob.core.windows.net', port=443): Max retries exceeded with url: /models/kobert/pytorch/kobert_v1.zip (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f2a668bf850>: Failed to establish a new connection: [Errno -2] Name or service not known'))
help wanted

opened by GohunPark 23
onnxruntime==0.3.0 설치에 관한 문의
🐛 Bug

onnxruntime==0.3.0 설치 오류를 해결할 수 없습니다

조언이 필요합니다!!!🙏

To Reproduce

pip install git+https://[email protected]/SKTBrain/KoBERT.git@master 실패

git clone 실패

수동으로 onnxruntime 0.3.0 버전을 설치하기 위해 whl 파일을 다운받았습니다

그럼에도 설치에 실패했습니다..

버그를 재현하기 위한 재현절차를 작성해주세요.

pip install git+https://[email protected]/SKTBrain/KoBERT.git@master

Expected behavior

kobert requirement 설치

Environment

GCP vm 인스턴스

Ubuntu 20.04 LTS

python 3.7.12

Additional context

같은 오류를 겪으신 분들은 어떻게 onnxruntime을 설치하셨나요? 도움이 절실합니다 🥲
bug
opened by EZYOON 15
[BUG] kobert를 Colab에서 구현할 때, tokenizer 사용시, nbest_size 오류 등이 발생
🐛 Bug

kobert를 Colab에서 구현할 때, tokenizer 사용시, nbest_size 오류 등이 발생합니다.

To Reproduce

!pip install ipywidgets # for vscode !pip install git+https://[email protected]/SKTBrain/KoBERT.git@master

import torch from torch import nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import Dataset, DataLoader import gluonnlp as nlp import numpy as np from tqdm.notebook import tqdm

from kobert import get_tokenizer from kobert import get_pytorch_kobert_model

from transformers import AdamW from transformers.optimization import get_cosine_schedule_with_warmup

CPU

device = torch.device("cpu")

GPU

device = torch.device("cuda:0")

bertmodel, vocab = get_pytorch_kobert_model(cachedir=".cache")

!wget -O .cache/ratings_train.txt http://skt-lsl-nlp-model.s3.amazonaws.com/KoBERT/datasets/nsmc/ratings_train.txt !wget -O .cache/ratings_test.txt http://skt-lsl-nlp-model.s3.amazonaws.com/KoBERT/datasets/nsmc/ratings_test.txt

dataset_train = nlp.data.TSVDataset(".cache/ratings_train.txt", field_indices=[1,2], num_discard_samples=1) dataset_test = nlp.data.TSVDataset(".cache/ratings_test.txt", field_indices=[1,2], num_discard_samples=1)

tokenizer = get_tokenizer() tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)

class BERTDataset(Dataset): def init(self, dataset, sent_idx, label_idx, bert_tokenizer, max_len, pad, pair): transform = nlp.data.BERTSentenceTransform( bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)

self.sentences = [transform([i[sent_idx]]) for i in dataset] self.labels = [np.int32(i[label_idx]) for i in dataset] def __getitem__(self, i): return (self.sentences[i] + (self.labels[i], )) def __len__(self): return (len(self.labels))

Setting parameters

max_len = 64 batch_size = 64 warmup_ratio = 0.1 num_epochs = 5 max_grad_norm = 1 log_interval = 200 learning_rate = 5e-5

data_train = BERTDataset(dataset_train, 0, 1, tok, max_len, True, False) data_test = BERTDataset(dataset_test, 0, 1, tok, max_len, True, False)

RuntimeError Traceback (most recent call last) in () ----> 1 data_train = BERTDataset(dataset_train, 0, 1, tok, max_len, True, False) 2 data_test = BERTDataset(dataset_test, 0, 1, tok, max_len, True, False)

8 frames /usr/local/lib/python3.7/dist-packages/sentencepiece/init.py in Encode(self, input, out_type, add_bos, add_eos, reverse, emit_unk_piece, enable_sampling, nbest_size, alpha, num_threads) 502 nbest_size == 1 or alpha is None): 503 raise RuntimeError( --> 504 'When enable_sampling is True, We must specify "nbest_size > 1" or "nbest_size = -1", ' 505 'and "alpha". "nbest_size" is enabled only on unigram mode ignored in BPE-dropout. ' 506 'when "nbest_size = -1" , this method samples from all candidates on the lattice '

RuntimeError: When enable_sampling is True, We must specify "nbest_size > 1" or "nbest_size = -1", and "alpha". "nbest_size" is enabled only on unigram mode ignored in BPE-dropout. when "nbest_size = -1" , this method samples from all candidates on the lattice instead of nbest segmentations.

버그를 재현하기 위한 재현절차를 작성해주세요. (Kobert를 활용한 NSMC 분류 작업(튜토리얼) 에서 공개된 코드 동일하게 사용)

Environment

Colab
bug
opened by KangYunPark 11
get_pytorch_kobert_model 대신 kobert_hf 사용하는 것으로 바꾸는 방법에 대한 문의
get_pytorch_kobert_model 불러오기 오류로 인해 https://github.com/SKTBrain/KoBERT/tree/master/kobert_hf 주소의 모델을 사용하는 것으로 코드를 급하게 바꾸는 중인데 어려움을 겪어서 질문 남깁니다. tokenizer = get_tokenizer() bertmodel, vocab = get_pytorch_kobert_model() 를 tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1') bertmodel, vocab = get_kobert_model('skt/kobert-base-v1',tokenizer.vocab_file)

으로 바꾸었는데, tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False) data_train = BERTDataset(dataset_train, 0, 1,tok, max_len, True, False) data_test = BERTDataset(dataset_test, 0, 1, tok, max_len, True, False) 코드에서 TypeError: not a string 에러가 발생합니다. 어떻게 해결하면 좋을까요...?

class BERTDataset(Dataset):

def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, max_len, pad, pair): transform = nlp.data.BERTSentenceTransform( bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair) self.sentences = [transform([i[sent_idx]]) for i in dataset] self.labels = [np.int32(i[label_idx]) for i in dataset] def __getitem__(self, i): return (self.sentences[i] + (self.labels[i], )) def __len__(self): return (len(self.labels))

BERTDataset은 이 코드를 쓰고 있습니다.
opened by jiwon199 8
[BUG] Colab에서는 제대로 작동하는데 windows에서는 자꾸 kobert 설치시 오류가 뜹니다.
🐛 Bug

Colab에서는 제대로 작동하는데 windows에서는 자꾸 kobert 설치시 오류가 뜹니다. gluonnlp, onnxruntime, mxnet을 모두 삭제하고 설치 해보기도 했고 해당하는 버전을 모두 설치한 이후에 kobert 설치를 진행했는데도 안됩니다. numpy도 삭제하고 설치해봤는데도 안됩니다. colab에서는 그냥

!pip install git+https://[email protected]/SKTBrain/KoBERT.git@master

한 줄만 적어줘도 잘 됐는데 무슨 차이인지 모르겠습니다.

To Reproduce

ERROR: Cannot install kobert because these package versions have conflicting dependencies.

The conflict is caused by: onnxruntime 1.8.0 depends on numpy>=1.16.6 gluonnlp 0.6.0 depends on numpy mxnet 1.4.0.post0 depends on numpy<1.15.0 and >=1.8.2 onnxruntime 1.8.0 depends on numpy>=1.16.6 gluonnlp 0.6.0 depends on numpy mxnet 1.4.0 depends on numpy<1.15.0 and >=1.8.2

To fix this you could try to:

loosen the range of package versions you've specified

remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

버그를 재현하기 위한 재현절차를 작성해주세요.

windows에서 anaconda로 실행

pip install git+https://[email protected]/SKTBrain/KoBERT.git@master 를 이용해 kobert 설치 시도

Expected behavior

다운로드 자체가 안되니 코드 자체를 돌려볼 수가 없습니다.

Environment

windows10 conda 4.10.3 python 3.7.12 numpy 1.16.6

Additional context
help wanted question
opened by jellypower 7

Error when getting model because of Transformers version

현상
- 아래와 같이 README.md 의 install 방법에 따라 설치한 후, 아래 코드를 수행할 경우 에러 발생

import torch
from kobert.pytorch_kobert import get_pytorch_kobert_model
model, vocab  = get_pytorch_kobert_model()

Traceback (most recent call last):
  File "pp.py", line 3, in <module>
    model, vocab  = get_pytorch_kobert_model()
  File "/home/jjlee/KoBERT/kobert/pytorch_kobert.py", line 64, in get_pytorch_kobert_model
    return get_kobert_model(model_path, vocab_path, ctx)
  File "/home/jjlee/KoBERT/kobert/pytorch_kobert.py", line 69, in get_kobert_model
    bertmodel.load_state_dict(torch.load(model_file))
  File "/home/jjlee/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1044, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for BertModel:
        Missing key(s) in state_dict: "embeddings.position_ids".

원인
- requirements.txt 에 따라 설치하면 각 패키지의 최신 버전이 설치되는데, transformers 패키지의 버전 차이로 인한 문제로 보여짐

bug

opened by bytecell 6

네이버 영화 리뷰 분류 콜랩 코드에서 에러가 발생합니다.

안녕하세요, 예제 코드를 돌리는 중에 에러가 발생하여 올립니다. naver_review_classifications_gluon_bert.ipynb을 돌리는 중에

bert_base, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False, ctx=ctx)에서 다음과 같은 에러가 발생합니다.

TypeError Traceback (most recent call last) in () ----> 1 bert_base, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False, ctx=ctx)

1 frames /usr/local/lib/python3.6/dist-packages/kobert/mxnet_kobert.py in get_kobert_model(model_file, vocab_file, use_pooler, use_decoder, use_classifier, ctx) 90 output_attention=False, 91 output_all_encodings=False, ---> 92 use_residual=predefined_args['use_residual']) 93 94 # BERT

TypeError: init() got an unexpected keyword argument 'attention_cell'

어떻게 고쳐야 하는지 알려주실 수 있나요?
bug

opened by sociengineer 6

Transformers==3.2.0 에서의 기학습된 모델 로딩이 실패합니다.

안녕하세요. transformers 버전 변경에 따라 모델 로딩이 실패한 경우가 있어서 이슈 남깁니다.

(transformers==3.0.2)

정상 작동됨을 확인하였습니다.

import torch
from kobert.pytorch_kobert import get_pytorch_kobert_model

input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
model, vocab  = get_pytorch_kobert_model()

sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)
pooled_output.shape  # torch.Size([2, 768])

(transformers==3.2.0)

동일 코드로 로딩이 실패합니다.

~/pyenv/versions/3.6.9/envs/envs/lib/python3.6/site-packages/kobert/pytorch_kobert.py in get_kobert_model(model_file, vocab_file, ctx)
     67 def get_kobert_model(model_file, vocab_file, ctx="cpu"):
     68     bertmodel = BertModel(config=BertConfig.from_dict(bert_config))
---> 69     bertmodel.load_state_dict(torch.load(model_file))
     70     device = torch.device(ctx)
     71     bertmodel.to(device)

~/pyenv/versions/3.6.9/envs/envs/lib/python3.6/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
   1043         if len(error_msgs) > 0:
   1044             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1045                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1046         return _IncompatibleKeys(missing_keys, unexpected_keys)
   1047 

RuntimeError: Error(s) in loading state_dict for BertModel:
        Missing key(s) in state_dict: "embeddings.position_ids".

enhancement

opened by lovit 5

KoBERT를 이용한 유사도 분석
안녕하세요.

KoBERT에 이미 구축된 임베딩을 이용해 문장 간 유사도 분석을 하고 싶습니다. 인터넷을 사용할 수 없는 상황이라 다음과 같이 model, vocab을 로드해서 사용하고 있습니다.

model, vocab = get_kobert_model(u'...\KoBERT\pytorch_kobert_2439f391a6.params', u'...\KoBERT\kobert_news_wiki_ko_cased-1087f8699e.spiece', 'cpu')

분류 예제를 보아도 특정 문장을 벡터화 할 수 있는 부분을 찾기가 어려워서 글을 남깁니다. 문장을 벡터화하고 유사도를 비교할 방법이 있을까요?
question
opened by HyeyeonKoo 5
[BUG] windows pycharm 환경에서 kobert 설치 할때 에러가 발생합니다.
🐛 Bug

windows10+ pycharm 환경에서 kobert 설치를 시도했는데 에러가 발생합니다.

To Reproduce

버그를 재현하기 위한 재현절차를 작성해주세요.

pip install git+https://[email protected]/SKTBrain/KoBERT.git@master

Expected behavior

kobert 설치 완료

Environment

windows10 python 3.7

그외 파이참에 설치한 패키지입니다.

Additional context

mxnet, onnxruntime를 설치하고 pip install git+https://[email protected]/SKTBrain/KoBERT.git@master 라고 해도 똑같은 에러가 발생합니다. 저보다 먼저 이슈를 남기신 분과 같은 에러인 것 같은데 저는 해결을 하지 못했습니다.
bug help wanted
opened by Blue-Kite 4
token_type_ids 관련 이슈
안녕하세요. 먼저 좋은 소스 공유해주셔서 진심으로 감사드립니다.

다름이 아니라 token_type_ids가 제대로 만들어지지 않는 것 같아 문의드립니다. 제가 사용한 코드는 아래와 같습니다. (transformers 라이브러리를 사용했습니다)

import torch from transformers import ( BertModel, BertForMaskedLM, DataCollatorForLanguageModeling ) from kobert_tokenizer import KoBERTTokenizer from torch.utils.data import DataLoader tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1') model = BertForMaskedLM.from_pretrained('skt/kobert-base-v1') data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer) train_dataloader = DataLoader( train_dataset, shuffle=True, batch_size=8, collate_fn=data_collator )

Dataloader로부터 생성된 batch를 확인해보니 padding된 부분의 token_type_ids가 3으로 생성되는 것을 확인할 수 있었습니다. 그리고 이 때문에 모델에 token_type_ids를 input으로 넣었을 때 에러가 발생하였습니다.

혹시 제가 잘못한/잘못 생각한 부분이 있다면 말씀 부탁드립니다. 감사합니다.
bug
opened by Yebin46 4
[BUG] 코랩에서 안뜨던 오류가 윈도우로 옮기니 typeError가 뜹니다..ㅠㅠ
🐛 Bug

kobert 모델 사용해서 프로젝트 진행 중인데 코랩에서는 별 문제 없이 값이 출력되는데 윈도우로 옮기니까 이런 오류가 뜹니다 혹시 어떻게 해결해야 할까요??

To Reproduce

버그를 재현하기 위한 재현절차를 작성해주세요.

Expected behavior

막히지 않고 그대로 결과가 도출됩니다.

Environment

conda 가상환경에서 진행했습니다

Additional context
bug
opened by GoGwang 0
kobert hidden states vector 추출 에러

kobert를 활용하여 분류 학습 후 파이토치 모델로 저장하였습니다 fine-tuning 후 새로운 데이터를 입력으로 주어 모델에 output중 hidden_states 값을 사용하려고합니다. transformer BertForSequenceClassification.from_pretrained()를 사용하여 fine_tuning을 진행하였고, 옵션으로 output_hidden_states = True로 설정한 뒤 모델을 다시 로드하여 데이터셋을 주고, output을 출력하면 1 12 512 768가 아닌 1 12 512 512으로 768차원에 embedding vector을 출력하지 못하고 있습니다. kobert fine_tuning model의 hidden_states를 사용하여 문장들의 CLS token vector를 얻고 싶은데, output을 어떻게 설정하고 가져와야하는 것인지 설명해 주시면 정말 감사드리겠습니다.
bug

opened by shinebin0501 0

Owner

SK T-Brain

Artificial Intelligence

GitHub

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

797 Dec 26, 2022

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

751 Dec 30, 2022

A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

94 Dec 8, 2022

A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 8, 2022

Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

40 Dec 20, 2022

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

1.1k Jan 3, 2023

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

MILES Multilingual Lexical Simplifier Explore the docs » Read LSBert Paper · Report Bug · Request Feature About The Project MILES is a multilingual te

45 Oct 19, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

59 Dec 1, 2022

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

KoBERT - Korean BERT pre-trained cased (KoBERT)

Related tags

Overview

KoBERT

Korean BERT pre-trained cased (KoBERT)

Why'?'

Training Environment

Requirements

How to install

How to use

Using with PyTorch

Using with ONNX

Using with MXNet-Gluon

Tokenizer

Subtasks

Naver Sentiment Analysis

KoBERT와 CRF로 만든 한국어 객체명인식기

Release

Contacts

License

Comments

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

🐛 Bug

To Reproduce

CPU

GPU

device = torch.device("cuda:0")

Setting parameters

Environment

class BERTDataset(Dataset):

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Owner

SK T-Brain

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

A BERT-based reverse-dictionary of Korean proverbs

A BERT-based reverse dictionary of Korean proverbs

Utilize Korean BERT model in sentence-transformers library

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Baseline code for Korean open domain question answering(ODQA)

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Transformer Based Korean Sentence Spacing Corrector

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)