Pretrained Japanese BERT models

Inui Laboratory

Last update: Dec 30, 2022

Related tags

Text Data & NLP bert-japanese

Overview

Pretrained Japanese BERT models

This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face.

Model hub: https://huggingface.co/cl-tohoku

For information on the previous versions of our pretrained models, see the v1.0 tag of this repository.

Model Architecture

The architecture of our models are the same as the original BERT models proposed by Google.

BERT-base models consist of 12 layers, 768 dimensions of hidden states, and 12 attention heads.
BERT-large models consist of 24 layers, 1024 dimensions of hidden states, and 16 attention heads.

Training Data

The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020.

The generated corpus files are 4.0GB in total, consisting of approximately 30M sentences. We used the MeCab morphological parser with mecab-ipadic-NEologd dictionary to split texts into sentences.

$WORK_DIR/corpus/jawiki-20200831/corpus_sampled.txt">

$ WORK_DIR="$HOME/work/bert-japanese"

$ python make_corpus_wiki.py \
--input_file jawiki-20200831-cirrussearch-content.json.gz \
--output_file $WORK_DIR/corpus/jawiki-20200831/corpus.txt \
--min_text_length 10 \
--max_text_length 200 \
--mecab_option "-r $HOME/local/etc/mecabrc -d $HOME/local/lib/mecab/dic/mecab-ipadic-neologd-v0.0.7"

# Split corpus files for parallel preprocessing of the files
$ python merge_split_corpora.py \
--input_files $WORK_DIR/corpus/jawiki-20200831/corpus.txt \
--output_dir $WORK_DIR/corpus/jawiki-20200831 \
--num_files 8

# Sample some lines for training tokenizers
$ cat $WORK_DIR/corpus/jawiki-20200831/corpus.txt|grep -v '^$'|shuf|head -n 1000000 \
> $WORK_DIR/corpus/jawiki-20200831/corpus_sampled.txt

Tokenization

For each of BERT-base and BERT-large, we provide two models with different tokenization methods.

For wordpiece models, the texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768.
For character models, the texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into characters. The vocabulary size is 6144.

We used fugashi and unidic-lite packages for the tokenization.

$WORK_DIR/tokenizers/jawiki-20200831/character/vocab.txt">

$ WORK_DIR="$HOME/work/bert-japanese"

# WordPiece (unidic_lite)
$ TOKENIZERS_PARALLELISM=false python train_tokenizer.py \
--input_files $WORK_DIR/corpus/jawiki-20200831/corpus_sampled.txt \
--output_dir $WORK_DIR/tokenizers/jawiki-20200831/wordpiece_unidic_lite \
--tokenizer_type wordpiece \
--mecab_dic_type unidic_lite \
--vocab_size 32768 \
--limit_alphabet 6129 \
--num_unused_tokens 10

# Character
$ head -n 6144 $WORK_DIR/tokenizers/jawiki-20200831/wordpiece_unidic_lite/vocab.txt \
> $WORK_DIR/tokenizers/jawiki-20200831/character/vocab.txt

Training

The models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

For training of each model, we used a v3-8 instance of Cloud TPUs provided by TensorFlow Research Cloud program. The training took about 5 days and 14 days for BERT-base and BERT-large models, respectively.

Creation of the pretraining data

$ WORK_DIR="$HOME/work/bert-japanese"

# WordPiece (unidic_lite)
$ mkdir -p $WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data
# It takes 3h and 420GB RAM, producing 43M instances
$ seq -f %02g 1 8|xargs -L 1 -I {} -P 8 python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/jawiki-20200831/corpus_{}.txt \
--output_file $WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/jawiki-20200831/wordpiece_unidic_lite/vocab.txt \
--tokenizer_type wordpiece \
--mecab_dic_type unidic_lite \
--do_whole_word_mask \
--gzip_compress \
--max_seq_length 512 \
--max_predictions_per_seq 80 \
--dupe_factor 10

# Character
$ mkdir $WORK_DIR/bert/jawiki-20200831/character/pretraining_data
# It takes 4h10m and 615GB RAM, producing 55M instances
$ seq -f %02g 1 8|xargs -L 1 -I {} -P 8 python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/jawiki-20200831/corpus_{}.txt \
--output_file $WORK_DIR/bert/jawiki-20200831/character/pretraining_data/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/jawiki-20200831/character/vocab.txt \
--tokenizer_type character \
--mecab_dic_type unidic_lite \
--do_whole_word_mask \
--gzip_compress \
--max_seq_length 512 \
--max_predictions_per_seq 80 \
--dupe_factor 10

Training of the models

Note: all the necessary files need to be stored in a Google Cloud Storage (GCS) bucket.

# BERT-base, WordPiece (unidic_lite)
$ ctpu up -name tpu01 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-base" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-base/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=1e-4 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu01

# BERT-base, Character
$ ctpu up -name tpu02 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/character/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/character/bert-base" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/character/bert-base/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=1e-4 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu02

# BERT-large, WordPiece (unidic_lite)
$ ctpu up -name tpu03 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-large" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-large/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=5e-5 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu03

# BERT-large, Character
$ ctpu up -name tpu04 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/character/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/character/bert-large" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/character/bert-large/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=5e-5 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu04

Licenses

The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0.

The codes in this repository are distributed under the Apache License 2.0.

Related Work

Original BERT model by Google Research Team
- https://github.com/google-research/bert
- https://github.com/tensorflow/models/tree/master/official/nlp/bert (for TensorFlow 2.0)
Juman-tokenized Japanese BERT model
- Author: Kurohashi-Kawahara Laboratory, Kyoto University
- http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT日本語Pretrainedモデル
MeCab-Jumandic-tokenized Japanese BERT model (trained with a large mini-batch size)
- Author: National Institute of Information and Communications Technology (NICT)
- https://alaginrc.nict.go.jp/nict-bert/index.html
Sentencepiece Japanese BERT model
- Author: Yohei Kikuta
- https://github.com/yoheikikuta/bert-japanese
Sentencepiece Japanese BERT model, trained on SNS corpus
- Author: Hottolink, Inc.
- https://github.com/hottolink/hottoSNS-bert

Acknowledgments

The models are trained with Cloud TPUs provided by TensorFlow Research Cloud program.

Comments

Fine tune

Do you have some guide to fine tune bert-japanese

I tried to fine tune, and result is not good. Seems like I did some thing wrong. Since GPU training is bit expensive, I like to have some opinion from you before finetune again .

Do I need to separate words using mecab-neologd ? Do I need to do some thing to tokenizer before fine tune ?

opened by nuwanq 9
About Pre-Training times

Nice to meet you. I will be using this gitlab code to pre-train with CloudTPU (v3-8). I have only done 1000 steps and it was going to take me 4 days to implement 1000000 steps. How many hours (days) did this gitlab pre-training take using CloudTPU(v3-8)?

opened by sezai-rdc 5
Will tokenizer remove stopwords?

I'm using the hugging face's japanese tokenizer. The name is ''cl-tohoku/bert-base-japanese-whole-word-masking'. Will it remove stopwords automatically in tokenizer and model?

opened by HeroadZ 4
Get the last output of the model 'cl-tohoku/bert-base-japanese-char-whole-word-masking'

The tokenizer is good for japanese but I want to get the last output layer of the model above. Since I am following the instruction in the huggingface that:

tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking") model = AutoModelWithLMHead.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking") input_ids = torch.tensor(tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0) # Batch size 1 outputs = model(input_ids) last_hidden_states = outputs[0]

Then i got len(outputs) = 1, The expected last_hidden_states shape is (batch,seq len, dmodel) but i got (batch,seq len, vocab size).

How can i get the shape (batch,seq len, dmodel) in of your model.

opened by demdecuong 4
AutoTokenizer.from_pretrained doesn't work on newer models
Thank you @singletongue for releasing new BERT models at Hugging Face, but their config.json does not include

"tokenizer_class": "BertJapaneseTokenizer",

thus Transformers' AutoTokenizer will use BertTokenizerFast. Please compare new config.json with old one, and please check the blog here written in Japanese.
opened by KoichiYasuoka 3
BertJapaneseTokenizer can find 'cl-tohoku/bert-base-japanese-whole-word-masking' but BertModel cannot ('cl-tohoku/bert-base-japanese-whole-word-masking')
During preprocessing, the following line has no problem.

self.tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

However, during training, I get the following error

Model name 'cl-tohoku/bert-base-japanese-whole-word-masking' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc).

from

BertModel.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

Any idea?

In both case, I install pytorch-transformers with pip. Thanks in advance for your help.
opened by wailoktam 3
Help on using the model for finetuning

The part about tokenizing with Mecab is clear but what about the sub-word tokenization? And what if there are words found in the data used for finetuning but not found in the data used for pretraining? Some guide on using your pretrained model would be great.

opened by wailoktam 3
Is tokenization.py needs to be uploaded to GCP?

Hi, I'm following your scripts to train a bert on my own datasets, I trained the tokenizer and created the pretraining data in local, and prepare to upload the tfrecord files to Google Cloud Storage (GCS) bucket for training. Do I need to upload your [ tokenization.py ] to replace the one provided by git cloned google-bert when training the model ? Thanks for your help.

opened by lightercs 2
[Question] About the Char model

Hi, thank you for sharing this project. I want to ask the reason for the MeCab tokenization in the Char model. Is there any difference between "directly split into characters" and "first MeCab tokenization and then split into characters"?

opened by AprilSongRits 2
Please tell us how to quote your model for paper

Thank you for your great job. I am writing international workshop paper, and I would like to quote your pretrained japanese model(I used in it) . If you did not set how to quote yet , please check below sample is OK or not. Suzuki Masatoshi(2019) Pretrained Japanese BERT models, GitHub, GitHub repository, https://github.com/cl-tohoku/bert-japanese Thank you for your help.

opened by nakamolinto 2
Swap Mecab tokenizer with Sentencepiece : possible ?

Hi @cl-tohoku, I wanted to get my hands dirty with your model to finetune a pos model. When going on your model card I wanted to test out your model using the recently released Hosted inference API from hugging face when I got this error: ⚠️ This model could not be loaded by the inference API. ⚠️ Error loading tokenizer No module named 'MeCab' ModuleNotFoundError("No module named 'MeCab'"). Correct me if I'm wrong but wouldn't be possible to swap out the Mecab based tokenizer with sentencepiece using this pretrained weights?

opened by sachaarbonel 2

strange tokenizer results with self-pretrained model

Hi, I trained a new vocab and bert model with my own datasets following your scripts, with the Mecab Dictionary being changed. but when I exam it, quite strange results returned everytime. Would you please help me check on this and give me some advice?

Details as below: My code:

from transformers import BertJapaneseTokenizer, BertForMaskedLM

model_name_or_path = "/content/drive/MyDrive/bert/new_bert/" 
tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path, mecab_kwargs={"mecab_option": "-d /content/drive/MyDrive/UniDic"})

model = BertForMaskedLM.from_pretrained(model_name_or_path)
input_ids = tokenizer.encode(f"青葉山で{tokenizer.mask_token}の研究をしています。", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(input_ids[0].tolist()))

masked_index = torch.where(input_ids == tokenizer.mask_token_id)[1][0].tolist()
print(masked_index)

result = model(input_ids)
pred_ids = result[0][:, masked_index].topk(5).indices.tolist()[0]
for pred_id in pred_ids:
    output_ids = input_ids.tolist()[0]
    output_ids[masked_index] = pred_id
    print(tokenizer.decode(output_ids))

the result:

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'BertJapaneseTokenizer'.
Some weights of the model checkpoint at /content/drive/MyDrive/bert/new_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']
4
[CLS] 青葉山で ヒダ の研究をしています 。 [SEP]
[CLS] 青葉山で 宿つ の研究をしています 。 [SEP]
[CLS] 青葉山で 法外 の研究をしています 。 [SEP]
[CLS] 青葉山で 頑丈 の研究をしています 。 [SEP]
[CLS] 青葉山で弱 の研究をしています 。 [SEP]

the tokenize result is firstly quite odd as below, and then the predict results.

['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']

but when I change to your pre-trained tokenizer bert-base-v2 (still use my model), the result changed alot.

Some weights of the model checkpoint at /content/drive/MyDrive/kindai_bert/kindai_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
4
[CLS] 青葉 山 で 宮司 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 飛翔 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 旧来 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 生野 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 でד の 研究 を し て い ます 。 [SEP]

My local bert folder is like:

Thank you in advance.

opened by lightercs 12

'Can't convert ['test.txt'] to Trainer' when training a BertWordPieceTokenizer
Hi Team,

I'm tring to train a Japanese Bert with my own data based on yours, and did't modify the structures. but when I pass the train data path to train a tokenizer, every time there go's wrong, the error is "Can't convert ['test.txt'] to Trainer".

here's something I tired:

pass a sigle filename or content of a single file (withine the same folder of the train_tokenizers.py file), the error appears.

pass a list of filenames like ['data_file/0.txt', 'data_file/1.txt', 'data_file/2.txt', 'data_file/3.txt', 'data_file/4.txt'] or single sentence list, the same error also occur.

Can you give any advise on this situation? Thanks a lot.
opened by suchunxie 2
SSL error

Max retries exceeded with url: //home/my_username/JapaneseBERTModel/cl-tohoku/bert-base-japanese-whole-word-masking//resolve/main/vocab.txt (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1123)')))

opened by leoxu1007 0
The results seems different from hugging face...
Thank you for the great model. I tried this model on our lab experiment machine. But the result seems different from that running on hugging face.

I used this model: https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking?text=%E3%83%AA%E3%83%B3%E3%82%B4%5BMASK%5D%E9%A3%9F%E3%81%B9%E3%82%8B%E3%80%82

And I wrote: リンゴ[MASK]食べる。

The model on the web gives that: リンゴを食べる。 0.870 リンゴも食べる。 0.108 リンゴは食べる。 0.009 リンゴのみ食べる。 0.005 リンゴとともに食べる。 0.001

And I download the model, run it locally. The output is: ['リンゴ', '[MASK]', '食べる', '。'] Some weights of the model checkpoint at /home/Xu_Zhenyu/JapaneseBERTModel/cl-tohoku/bert-base-japanese-whole-word-masking/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). 0 を 1 、 2 も 3 野菜 4 で

The results[を　も　は　のみ　とともに] and [を　、　も　野菜　で] is different, why?

And I have another question, there are 0.870, 0.108, 0.009 etc on the web. How can I get those numbers locally?

Thank you for your time.
opened by leoxu1007 0

Error when initializing from the transformers pipeline

Hello,

I get an error when trying to initialize models that rely on your tokenizer from the transformers package's pipeline. Here is code that yields the error as well as the traceback.

from transformers import pipeline 

sentiment_analyzer = pipeline(
    "sentiment-analysis", model="cl-tohoku/bert-base-japanese", tokenizer="cl-tohoku/bert-base-japanese")

Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\pipelines\__init__.py", line 377, in pipeline
    tokenizer = AutoTokenizer.from_pretrained(tokenizer, revision=revision, use_fast=use_fast)
  File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 391, in from_pretrained
    tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
  File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 294, in tokenizer_class_from_name
    if c.__name__ == class_name:
AttributeError: 'NoneType' object has no attribute '__name__'

opened by EtienneGagnon1 7

How to add new vocabulary to vocab.txt

Hi Team,

I want to add new domain specific words to tokenizer vocabulary so that I can do more better Word-separation(wakachi-gaki) for those words which are not in default vocab.txt

Is this correct way ? 1: manually add words in the bottom of vocab.txt (from last line) 2: Initialize tokenizer as below tokenizer = BertJapaneseTokenizer.from_pretrained("{Directory Path to vocab.txt and cofig.json etc...}") Thanks,

opened by kaoruoshita 2

Owner

Inui Laboratory

Inui Laboratory, Tohoku University

GitHub

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

7 Nov 7, 2022

Code for evaluating Japanese pretrained models provided by NTT Ltd.

japanese-dialog-transformers 日本語の説明文はこちら This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialo

216 Dec 22, 2022

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

751 Dec 30, 2022

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

1.2k Jan 8, 2023

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

903 Feb 17, 2021

Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

37 Dec 25, 2022

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

605 Jan 2, 2023

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

MILES Multilingual Lexical Simplifier Explore the docs » Read LSBert Paper · Report Bug · Request Feature About The Project MILES is a multilingual te

45 Oct 19, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

59 Dec 1, 2022

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

491 Jan 7, 2023

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

160 Dec 23, 2022

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

Pretrained Japanese BERT models

Related tags

Overview

Pretrained Japanese BERT models

Model Architecture

Training Data

Tokenization

Training

Creation of the pretraining data

Training of the models

Licenses

Related Work

Acknowledgments

Comments

Owner

Inui Laboratory

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Code for evaluating Japanese pretrained models provided by NTT Ltd.

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Codes to pre-train Japanese T5 models

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Japanese synonym library

AllenNLP integration for Shiba: Japanese CANINE model

Auto translate textbox from Japanese to English or Indonesia

Script to download some free japanese lessons in portuguse from NHK

An open collection of annotated voices in Japanese language