Pretrained Japanese BERT models

Overview

Pretrained Japanese BERT models

This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face.

For information on the previous versions of our pretrained models, see the v1.0 tag of this repository.

Model Architecture

The architecture of our models are the same as the original BERT models proposed by Google.

  • BERT-base models consist of 12 layers, 768 dimensions of hidden states, and 12 attention heads.
  • BERT-large models consist of 24 layers, 1024 dimensions of hidden states, and 16 attention heads.

Training Data

The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020.

The generated corpus files are 4.0GB in total, consisting of approximately 30M sentences. We used the MeCab morphological parser with mecab-ipadic-NEologd dictionary to split texts into sentences.

$WORK_DIR/corpus/jawiki-20200831/corpus_sampled.txt">
$ WORK_DIR="$HOME/work/bert-japanese"

$ python make_corpus_wiki.py \
--input_file jawiki-20200831-cirrussearch-content.json.gz \
--output_file $WORK_DIR/corpus/jawiki-20200831/corpus.txt \
--min_text_length 10 \
--max_text_length 200 \
--mecab_option "-r $HOME/local/etc/mecabrc -d $HOME/local/lib/mecab/dic/mecab-ipadic-neologd-v0.0.7"

# Split corpus files for parallel preprocessing of the files
$ python merge_split_corpora.py \
--input_files $WORK_DIR/corpus/jawiki-20200831/corpus.txt \
--output_dir $WORK_DIR/corpus/jawiki-20200831 \
--num_files 8

# Sample some lines for training tokenizers
$ cat $WORK_DIR/corpus/jawiki-20200831/corpus.txt|grep -v '^$'|shuf|head -n 1000000 \
> $WORK_DIR/corpus/jawiki-20200831/corpus_sampled.txt

Tokenization

For each of BERT-base and BERT-large, we provide two models with different tokenization methods.

  • For wordpiece models, the texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768.
  • For character models, the texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into characters. The vocabulary size is 6144.

We used fugashi and unidic-lite packages for the tokenization.

$WORK_DIR/tokenizers/jawiki-20200831/character/vocab.txt">
$ WORK_DIR="$HOME/work/bert-japanese"

# WordPiece (unidic_lite)
$ TOKENIZERS_PARALLELISM=false python train_tokenizer.py \
--input_files $WORK_DIR/corpus/jawiki-20200831/corpus_sampled.txt \
--output_dir $WORK_DIR/tokenizers/jawiki-20200831/wordpiece_unidic_lite \
--tokenizer_type wordpiece \
--mecab_dic_type unidic_lite \
--vocab_size 32768 \
--limit_alphabet 6129 \
--num_unused_tokens 10

# Character
$ head -n 6144 $WORK_DIR/tokenizers/jawiki-20200831/wordpiece_unidic_lite/vocab.txt \
> $WORK_DIR/tokenizers/jawiki-20200831/character/vocab.txt

Training

The models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

For training of each model, we used a v3-8 instance of Cloud TPUs provided by TensorFlow Research Cloud program. The training took about 5 days and 14 days for BERT-base and BERT-large models, respectively.

Creation of the pretraining data

$ WORK_DIR="$HOME/work/bert-japanese"

# WordPiece (unidic_lite)
$ mkdir -p $WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data
# It takes 3h and 420GB RAM, producing 43M instances
$ seq -f %02g 1 8|xargs -L 1 -I {} -P 8 python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/jawiki-20200831/corpus_{}.txt \
--output_file $WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/jawiki-20200831/wordpiece_unidic_lite/vocab.txt \
--tokenizer_type wordpiece \
--mecab_dic_type unidic_lite \
--do_whole_word_mask \
--gzip_compress \
--max_seq_length 512 \
--max_predictions_per_seq 80 \
--dupe_factor 10

# Character
$ mkdir $WORK_DIR/bert/jawiki-20200831/character/pretraining_data
# It takes 4h10m and 615GB RAM, producing 55M instances
$ seq -f %02g 1 8|xargs -L 1 -I {} -P 8 python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/jawiki-20200831/corpus_{}.txt \
--output_file $WORK_DIR/bert/jawiki-20200831/character/pretraining_data/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/jawiki-20200831/character/vocab.txt \
--tokenizer_type character \
--mecab_dic_type unidic_lite \
--do_whole_word_mask \
--gzip_compress \
--max_seq_length 512 \
--max_predictions_per_seq 80 \
--dupe_factor 10

Training of the models

Note: all the necessary files need to be stored in a Google Cloud Storage (GCS) bucket.

# BERT-base, WordPiece (unidic_lite)
$ ctpu up -name tpu01 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-base" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-base/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=1e-4 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu01

# BERT-base, Character
$ ctpu up -name tpu02 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/character/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/character/bert-base" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/character/bert-base/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=1e-4 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu02

# BERT-large, WordPiece (unidic_lite)
$ ctpu up -name tpu03 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-large" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/wordpiece_unidic_lite/bert-large/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=5e-5 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu03

# BERT-large, Character
$ ctpu up -name tpu04 -tpu-size v3-8 -tf-version 2.3
$ cd /usr/share/models
$ sudo pip3 install -r official/requirements.txt
$ tmux
$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
$ WORK_DIR="gs://
   
    /bert-japanese
    "
   
$ python3 official/nlp/bert/run_pretraining.py \
--input_files="$WORK_DIR/bert/jawiki-20200831/character/pretraining_data/pretraining_data_*.tfrecord" \
--model_dir="$WORK_DIR/bert/jawiki-20200831/character/bert-large" \
--bert_config_file="$WORK_DIR/bert/jawiki-20200831/character/bert-large/config.json" \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--train_batch_size=256 \
--learning_rate=5e-5 \
--num_train_epochs=100 \
--num_steps_per_epoch=10000 \
--optimizer_type=adamw \
--warmup_steps=10000 \
--distribution_strategy=tpu \
--tpu=tpu04

Licenses

The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0.

The codes in this repository are distributed under the Apache License 2.0.

Related Work

Acknowledgments

The models are trained with Cloud TPUs provided by TensorFlow Research Cloud program.

Comments
  • Fine tune

    Fine tune

    Do you have some guide to fine tune bert-japanese

    I tried to fine tune, and result is not good. Seems like I did some thing wrong. Since GPU training is bit expensive, I like to have some opinion from you before finetune again .

    Do I need to separate words using mecab-neologd ? Do I need to do some thing to tokenizer before fine tune ?

    opened by nuwanq 9
  • About Pre-Training times

    About Pre-Training times

    Nice to meet you. I will be using this gitlab code to pre-train with CloudTPU (v3-8). I have only done 1000 steps and it was going to take me 4 days to implement 1000000 steps. How many hours (days) did this gitlab pre-training take using CloudTPU(v3-8)?

    opened by sezai-rdc 5
  • Will tokenizer remove stopwords?

    Will tokenizer remove stopwords?

    I'm using the hugging face's japanese tokenizer. The name is ''cl-tohoku/bert-base-japanese-whole-word-masking'. Will it remove stopwords automatically in tokenizer and model?

    opened by HeroadZ 4
  • Get the last output of the model 'cl-tohoku/bert-base-japanese-char-whole-word-masking'

    Get the last output of the model 'cl-tohoku/bert-base-japanese-char-whole-word-masking'

    The tokenizer is good for japanese but I want to get the last output layer of the model above. Since I am following the instruction in the huggingface that:

    tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking") model = AutoModelWithLMHead.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking") input_ids = torch.tensor(tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0) # Batch size 1 outputs = model(input_ids) last_hidden_states = outputs[0]

    Then i got len(outputs) = 1, The expected last_hidden_states shape is (batch,seq len, dmodel) but i got (batch,seq len, vocab size).

    How can i get the shape (batch,seq len, dmodel) in of your model.

    opened by demdecuong 4
  • AutoTokenizer.from_pretrained doesn't work on newer models

    AutoTokenizer.from_pretrained doesn't work on newer models

    Thank you @singletongue for releasing new BERT models at Hugging Face, but their config.json does not include

      "tokenizer_class": "BertJapaneseTokenizer",
    

    thus Transformers' AutoTokenizer will use BertTokenizerFast. Please compare new config.json with old one, and please check the blog here written in Japanese.

    opened by KoichiYasuoka 3
  • BertJapaneseTokenizer can find 'cl-tohoku/bert-base-japanese-whole-word-masking' but  BertModel cannot ('cl-tohoku/bert-base-japanese-whole-word-masking')

    BertJapaneseTokenizer can find 'cl-tohoku/bert-base-japanese-whole-word-masking' but BertModel cannot ('cl-tohoku/bert-base-japanese-whole-word-masking')

    During preprocessing, the following line has no problem.

        self.tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
    

    However, during training, I get the following error

    Model name 'cl-tohoku/bert-base-japanese-whole-word-masking' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc).

    from

    BertModel.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

    Any idea?

    In both case, I install pytorch-transformers with pip. Thanks in advance for your help.

    opened by wailoktam 3
  • Help on using the model for finetuning

    Help on using the model for finetuning

    The part about tokenizing with Mecab is clear but what about the sub-word tokenization? And what if there are words found in the data used for finetuning but not found in the data used for pretraining? Some guide on using your pretrained model would be great.

    opened by wailoktam 3
  • Is tokenization.py needs to be uploaded to GCP?

    Is tokenization.py needs to be uploaded to GCP?

    Hi, I'm following your scripts to train a bert on my own datasets, I trained the tokenizer and created the pretraining data in local, and prepare to upload the tfrecord files to Google Cloud Storage (GCS) bucket for training. Do I need to upload your [ tokenization.py ] to replace the one provided by git cloned google-bert when training the model ? Thanks for your help.

    opened by lightercs 2
  • [Question] About the Char model

    [Question] About the Char model

    Hi, thank you for sharing this project. I want to ask the reason for the MeCab tokenization in the Char model. Is there any difference between "directly split into characters" and "first MeCab tokenization and then split into characters"?

    opened by AprilSongRits 2
  • Please tell us how to quote your model for paper

    Please tell us how to quote your model for paper

    Thank you for your great job. I am writing international workshop paper, and I would like to quote your pretrained japanese model(I used in it) . If you did not set how to quote yet , please check below sample is OK or not. Suzuki Masatoshi(2019) Pretrained Japanese BERT models, GitHub, GitHub repository, https://github.com/cl-tohoku/bert-japanese Thank you for your help.

    opened by nakamolinto 2
  • Swap Mecab tokenizer with Sentencepiece : possible ?

    Swap Mecab tokenizer with Sentencepiece : possible ?

    Hi @cl-tohoku, I wanted to get my hands dirty with your model to finetune a pos model. When going on your model card I wanted to test out your model using the recently released Hosted inference API from hugging face when I got this error: ⚠️ This model could not be loaded by the inference API. ⚠️ Error loading tokenizer No module named 'MeCab' ModuleNotFoundError("No module named 'MeCab'"). Correct me if I'm wrong but wouldn't be possible to swap out the Mecab based tokenizer with sentencepiece using this pretrained weights?

    opened by sachaarbonel 2
  • strange tokenizer results with self-pretrained model

    strange tokenizer results with self-pretrained model

    Hi, I trained a new vocab and bert model with my own datasets following your scripts, with the Mecab Dictionary being changed. but when I exam it, quite strange results returned everytime. Would you please help me check on this and give me some advice?

    Details as below: My code:

    from transformers import BertJapaneseTokenizer, BertForMaskedLM
    
    model_name_or_path = "/content/drive/MyDrive/bert/new_bert/" 
    tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path, mecab_kwargs={"mecab_option": "-d /content/drive/MyDrive/UniDic"})
    
    model = BertForMaskedLM.from_pretrained(model_name_or_path)
    input_ids = tokenizer.encode(f"青葉山で{tokenizer.mask_token}の研究をしています。", return_tensors="pt")
    print(tokenizer.convert_ids_to_tokens(input_ids[0].tolist()))
    
    masked_index = torch.where(input_ids == tokenizer.mask_token_id)[1][0].tolist()
    print(masked_index)
    
    result = model(input_ids)
    pred_ids = result[0][:, masked_index].topk(5).indices.tolist()[0]
    for pred_id in pred_ids:
        output_ids = input_ids.tolist()[0]
        output_ids[masked_index] = pred_id
        print(tokenizer.decode(output_ids))
    

    the result:

    The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
    The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
    The class this function is called from is 'BertJapaneseTokenizer'.
    Some weights of the model checkpoint at /content/drive/MyDrive/bert/new_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
    - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
    - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    ['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']
    4
    [CLS] 青葉山で ヒダ の研究をしています 。 [SEP]
    [CLS] 青葉山で 宿つ の研究をしています 。 [SEP]
    [CLS] 青葉山で 法外 の研究をしています 。 [SEP]
    [CLS] 青葉山で 頑丈 の研究をしています 。 [SEP]
    [CLS] 青葉山で弱 の研究をしています 。 [SEP]
    

    the tokenize result is firstly quite odd as below, and then the predict results.

    ['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']
    

    but when I change to your pre-trained tokenizer bert-base-v2 (still use my model), the result changed alot.

    Some weights of the model checkpoint at /content/drive/MyDrive/kindai_bert/kindai_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
    - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
    - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    ['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
    4
    [CLS] 青葉 山 で 宮司 の 研究 を し て い ます 。 [SEP]
    [CLS] 青葉 山 で 飛翔 の 研究 を し て い ます 。 [SEP]
    [CLS] 青葉 山 で 旧来 の 研究 を し て い ます 。 [SEP]
    [CLS] 青葉 山 で 生野 の 研究 を し て い ます 。 [SEP]
    [CLS] 青葉 山 でד の 研究 を し て い ます 。 [SEP]
    

    My local bert folder is like: image

    Thank you in advance.

    opened by lightercs 12
  • 'Can't convert ['test.txt'] to Trainer' when training a BertWordPieceTokenizer

    'Can't convert ['test.txt'] to Trainer' when training a BertWordPieceTokenizer

    Hi Team,

    I'm tring to train a Japanese Bert with my own data based on yours, and did't modify the structures. but when I pass the train data path to train a tokenizer, every time there go's wrong, the error is "Can't convert ['test.txt'] to Trainer".

    here's something I tired:

    1. pass a sigle filename or content of a single file (withine the same folder of the train_tokenizers.py file), the error appears.
    2. pass a list of filenames like ['data_file/0.txt', 'data_file/1.txt', 'data_file/2.txt', 'data_file/3.txt', 'data_file/4.txt'] or single sentence list, the same error also occur.

    Can you give any advise on this situation? Thanks a lot.

    opened by suchunxie 2
  • SSL error

    SSL error

    Max retries exceeded with url: //home/my_username/JapaneseBERTModel/cl-tohoku/bert-base-japanese-whole-word-masking//resolve/main/vocab.txt (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1123)')))

    opened by leoxu1007 0
  • The results seems different from hugging face...

    The results seems different from hugging face...

    Thank you for the great model. I tried this model on our lab experiment machine. But the result seems different from that running on hugging face.

    I used this model: https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking?text=%E3%83%AA%E3%83%B3%E3%82%B4%5BMASK%5D%E9%A3%9F%E3%81%B9%E3%82%8B%E3%80%82

    And I wrote: リンゴ[MASK]食べる。

    The model on the web gives that: リンゴ を 食べる 。 0.870 リンゴ も 食べる 。 0.108 リンゴ は 食べる 。 0.009 リンゴ のみ 食べる 。 0.005 リンゴ とともに 食べる 。 0.001

    And I download the model, run it locally. The output is: ['リンゴ', '[MASK]', '食べる', '。'] Some weights of the model checkpoint at /home/Xu_Zhenyu/JapaneseBERTModel/cl-tohoku/bert-base-japanese-whole-word-masking/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

    • This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
    • This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). 0 を 1 、 2 も 3 野菜 4 で

    The results[を も は のみ とともに] and [を 、 も 野菜 で] is different, why?

    And I have another question, there are 0.870, 0.108, 0.009 etc on the web. How can I get those numbers locally?

    Thank you for your time.

    opened by leoxu1007 0
  • Error when initializing from the transformers pipeline

    Error when initializing from the transformers pipeline

    Hello,

    I get an error when trying to initialize models that rely on your tokenizer from the transformers package's pipeline. Here is code that yields the error as well as the traceback.

    from transformers import pipeline 
    
    sentiment_analyzer = pipeline(
        "sentiment-analysis", model="cl-tohoku/bert-base-japanese", tokenizer="cl-tohoku/bert-base-japanese")
    
    Traceback (most recent call last):
      File "<input>", line 3, in <module>
      File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\pipelines\__init__.py", line 377, in pipeline
        tokenizer = AutoTokenizer.from_pretrained(tokenizer, revision=revision, use_fast=use_fast)
      File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 391, in from_pretrained
        tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
      File "C:\Users\gagno\Anaconda3\envs\japanese_admin_scrape\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 294, in tokenizer_class_from_name
        if c.__name__ == class_name:
    AttributeError: 'NoneType' object has no attribute '__name__'
    
    opened by EtienneGagnon1 7
  • How to add new vocabulary to vocab.txt

    How to add new vocabulary to vocab.txt

    Hi Team,

    I want to add new domain specific words to tokenizer vocabulary so that I can do more better Word-separation(wakachi-gaki) for those words which are not in default vocab.txt

    Is this correct way ? 1: manually add words in the bottom of vocab.txt (from last line) 2: Initialize tokenizer as below tokenizer = BertJapaneseTokenizer.from_pretrained("{Directory Path to vocab.txt and cofig.json etc...}") Thanks,

    opened by kaoruoshita 2
Owner
Inui Laboratory
Inui Laboratory, Tohoku University
Inui Laboratory
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

Jonas Belouadi 7 Nov 7, 2022
Code for evaluating Japanese pretrained models provided by NTT Ltd.

japanese-dialog-transformers 日本語の説明文はこちら This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialo

NTT Communication Science Laboratories 216 Dec 22, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 8, 2023
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 903 Feb 17, 2021
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 2, 2023
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 7, 2023
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 6, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022