Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

rinna Co.,Ltd.

Last update: Jan 7, 2023

Related tags

Overview

japanese-gpt2

This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium released on HuggingFace model hub by rinna.

Please open an issue (in English/日本語) if you encounter any problem using the code or using our models via Huggingface.

Train a Japanese GPT-2 from scratch on your own machine

Download training corpus Japanese CC-100 and extract the ja.txt file.
Move the ja.txt file or modify src/corpus/jp_cc100/config.py to match the filepath of ja.txt with self.raw_data_dir in the config file.
Split ja.txt to smaller files by running:

cd src/
python -m corpus.jp_cc100.split_to_small_files

Train a medium-sized GPT-2 on 4 GPUs by running:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m task.pretrain.train --n_gpus 4 --save_model True --enable_log True

Interact with the trained model

Assume you have run the training script and saved your medium-sized GPT-2 to data/model/gpt2-medium-xxx.checkpoint. Run the following command to use it to complete text on one GPU by nucleus sampling with p=0.95 and k=40:

CUDA_VISIBLE_DEVICES=0 python -m task.pretrain.interact --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --gen_type top --top_p 0.95 --top_k 40

Prepare files for uploading to Huggingface

Make your Huggingface account; Create a model repo; Clone it to your local machine.
Create model and config files from a checkpoint by running:

python -m task.pretrain.checkpoint2huggingface --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --save_dir {huggingface's model repo directory}

Validate the created files by running:

python -m task.pretrain.check_huggingface --model_dir {huggingface's model repo directory}

Add files, commit, and push to your Huggingface repo.

Customize your training script

Check available arguments by running:

python -m task.pretrain.train --help

License

The MIT license

Comments

License Issues

Hi, I'm @singletongue, a maintainer of the cl-tohoku/bert-japanese.

Thank you for sharing your great work.

However, I'm a little concerned that some parts of your code src/corpus/build_pretrain_dataset.py are possibly taken from our code make_corpus_wiki.py. Since we are releasing our codes under the Apache License 2.0, It might be better if you adopted the same license, not the MIT license.

Thank you.

opened by singletongue 6
rinna RoBERTa's max_length is 510 not 512?
Hi, I have been using rinna RoBERTa for a while now. I have a question. The max_length of rinna RoBERTa is 510 (not 512), right? Is this the intended result? If this was the intended result, why did you use 510 instead of 512 for max_length?

rinna RoBERTa's padding_idx is 3 (not 1). So I think the starting position of position_embeddings is padding_idx+1=4 as in the following problem, but the size of position_embeddings in rinna RoBERTa is (514, 768). If I actually enter text with a length of 512, I get an index error.

https://github.com/pytorch/fairseq/issues/1187
opened by masayakondo 4

Tensor size does not match

Description

GPT-2 train fails with an error "RuntimeError: The size of tensor a (768) must match the size of tensor b (1024) at non-singleton dimension 3".

I followed the steps of "Train japanese-gpt2-xsmall from scratch", except that n_gpus was set to 1 and mecab_dict_path was changed to the path of unidic-csj-3.0.1.1.

What's wrong?

Full output of python -m task.pretrain_gpt2.train:

local rank: [0], global_rank: [0]
Number of training files: 502
Number of dev files: 1
----- Loading dev data -----
{'n_docs': 10000, 'n_sents': 131762, 'n_tokens': 4241376}
----- Hyper-parameters -----
balanced_corpora: None
batch_size: 20
check_loss_after_n_step: 100.0
checkpoint_path: None
corpora: ['jp_cc100', 'jp_wiki']
enable_log: True
eval_batch_size: 40
filename_note: None
init_lr: 0.0007
l2_penalty: 0.01
master_port: 12321
max_grad_norm: 1.0
max_seq_len: 1024
model_config_filepath: model/gpt2-ja-xsmall-config.json
model_size: xsmall
n_accum_steps: 3
n_epochs: 10
n_gpus: 1
n_nodes: 1
n_train_files_per_group: 10
n_training_steps: 1600000
n_warmup_steps: 2000.0
node_rank: 0
resume_training: False
save_model: True
seed: 42
small_data: False
use_amp: True
validate_after_n_step: 5000.0
world_size: 1
{'n_docs': 1367409, 'n_sents': 8632681, 'n_tokens': 288213354}
Traceback (most recent call last):
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/var/tmp/hiroki/japanese-pretrained-models/src/task/pretrain_gpt2/train.py", line 580, in <module>
    train(0, config)
  File "/var/tmp/hiroki/japanese-pretrained-models/src/task/pretrain_gpt2/train.py", line 409, in train
    loss, ppl = forward_step(model, tokenizer, batch_data)
  File "/var/tmp/hiroki/japanese-pretrained-models/src/task/pretrain_gpt2/train.py", line 85, in forward_step
    gpt2_outputs = model(input_ids=input_ids, return_dict=True)
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 904, in forward
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 752, in forward
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 290, in forward
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 241, in forward
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 176, in _attn
RuntimeError: The size of tensor a (768) must match the size of tensor b (1024) at non-singleton dimension 3

Environment

python == 3.8.13 PyTorch == 1.12.1 transformers == 4.4.2

opened by hirokisince1998 3

The load_docs_from_filepath method in src/task/pretrain_roberta/train.py just return empty list.

The load_docs_from_filepath method in src/task/pretrain_roberta/train.py only return empty list. Is it intended behavior? Thank you.

def load_docs_from_filepath(filepath, tokenizer):
    docs = []
    with open(filepath, encoding="utf-8") as f:
        doc = []
        for line in f:
            line = line.strip()
            if line == "":
                if len(doc) > 0:
                    docs.append(doc)
                doc = []
            else:
                sent = line
                tokens = tokenizer.tokenize(sent)
                token_ids = tokenizer.convert_tokens_to_ids(tokens)
                if len(token_ids) > 0:
                    doc.append(token_ids)
    return docs

opened by HiroshigeAoki 2

Train japanese-gpt2-xsmall from scratch

After the following command,

python -m corpus.jp_wiki.build_pretrain_dataset

the following command is necessary for training japanese-gpt2-xsmall from scratch.

python -m corpus.jp_wiki.split_to_small_files

If so, please update the usage.

opened by jurader 1

Please update data’s url.

I noticed that the wikipedia dataset has been updated in all languages.

as is (src/coupus/jp_wiki/config.py)

class Config(object):
    def __init__(self):
        self.corpus_name = "jp_wiki"

        # Management
        self.download_link = "https://dumps.wikimedia.org/other/cirrussearch/current/jawiki-20211025-cirrussearch-content.json.gz"
        self.raw_data_dir = "../data/jp_wiki/raw_data"
        self.raw_data_path = f"{self.raw_data_dir}/wiki.json.gz"
        self.extracted_data_path = f"{self.raw_data_dir}/wiki.extracted.txt"
        self.doc_data_dir = "../data/jp_wiki/doc_data"

to be (src/coupus/jp_wiki/config.py)

class Config(object):
    def __init__(self):
        self.corpus_name = "jp_wiki"

        # Management
        self.download_link = "https://dumps.wikimedia.org/other/cirrussearch/current/jawiki-20220228-cirrussearch-content.json.gz"
        self.raw_data_dir = "../data/jp_wiki/raw_data"
        self.raw_data_path = f"{self.raw_data_dir}/wiki.json.gz"
        self.extracted_data_path = f"{self.raw_data_dir}/wiki.extracted.txt"
        self.doc_data_dir = "../data/jp_wiki/doc_data"

opened by spider-man-tm 1

Please add "tokenizer_class" in "config.json"
Please add tokenizer_class in config.json like

"tokenizer_class": "T5Tokenizer",

. This enables use of AutoTokenizer like

tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-1b")

instead of

tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-gpt-1b")

(Other models can be changed in the same way.)

Related to: https://github.com/cl-tohoku/bert-japanese/issues/24
opened by shirayu 1
Japanese Wikipedia dump link has changed

First of all, thanks for great project!

Currently, wikipedia link is fixed to https://dumps.wikimedia.org/other/cirrussearch/20210329/jawiki-20210329-cirrussearch-content.json.gz. However, it looks like the manager dispose the dump as it becomes older. The latest version is https://dumps.wikimedia.org/other/cirrussearch/20211025/jawiki-20211025-cirrussearch-content.json.gz. It would be grateful to note it in README.

(I also find that CC-100 link is broken now, but it is not your fault.)

opened by kaisugi 1

Can I use `rinna/japanese-roberta-base` through `AutoTokenizer` ?

Hi, thank you very much for publishing such a wonderful Japanese pre-trained model! I am very happy to use this model.

I would like to load the pre-trained tokenizer from AutoTokenizer.from_pretrained, but I encountered the following error. Do you support loading the pre-trained tokenizer from AutoTokenizer.from_pretrained ?

$ python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('rinna/japanese-roberta-base')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 423, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1709, in from_pretrained
    return cls._from_pretrained(
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1722, in _from_pretrained
    slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1781, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/models/roberta/tokenization_roberta.py", line 159, in __init__
    super().__init__(
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2.py", line 179, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

Fortunately, AutoModel.from_pretrained can be run successfully (the warning message can be ignored this time).

$ python -c "from transformers import AutoModel; AutoModel.from_pretrained('rinna/japanese-roberta-base')"
Some weights of RobertaModel were not initialized from the model checkpoint at rinna/japanese-roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

The following is my system environment:

python 3.8.8
transformers 4.5.1

I would appreciate any advice on how to load it this way. Thanks.

opened by shunk031 1

japanese-roberta-base/README.md typo
Hello. I am enjoying researching with this model😊

I found a typo.

https://huggingface.co/rinna/japanese-roberta-base/blob/main/README.md

typo?

masked_idx = 6 # This is 5

reason

tokens [MASK] index is 5.

print(tokens) # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']

Please check it out.
opened by umi-tyaahan 1
docs: demo, experiments and live inference API on Tiyaro
Hello Maintainer of Github repo rinnakk/japanese-pretrained-models!

Thank you for your work on rinnakk/japanese-pretrained-models. This GitHub project is interesting, and we think that it would be a great addition to make this work instantly discoverable & available as an API for all your users, to quickly try and use it in their applications.

The list of model card(s) covered by this PR are:

https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium

https://console.tiyaro.ai/explore/rinna-japanese-roberta-base

On Tiyaro, every model in rinnakk/japanese-pretrained-models will get its own:

Dedicated model card (e.g. https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium

Model demo (e.g. https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium/demo)

Unique Inference API (e.g. https://api.tiyaro.ai/explore/huggingface/1//rinna/japanese-gpt2-medium)

Sample code snippets and swagger spec for the API

Users will also be able to compare your model with other models of similar types on various parameters using Tiyaro Experiments (https://tiyaro.ai/blog/ocr/)

—- I am from Tiyaro.ai (https://tiyaro.ai/). We are working on enabling developers to instantly evaluate, use and customize the world’s best AI. We are constantly working on adding new features to Tiyaro EasyTrain, EasyServe & Experiments, to make the best use of your ML model, and making AI more accessible for anyone.

Sincerely, I-Jong Lin
size/XS
opened by ijonglin 0
docs: demo, experiments and live inference API on Tiyaro
Hello Maintainer of Github repo rinnakk/japanese-gpt2!

Thank you for your work on rinnakk/japanese-gpt2. This GitHub project is interesting, and we think that it would be a great addition to make this work instantly discoverable & available as an API for all your users, to quickly try and use it in their applications.

The list of model card(s) covered by this PR are:

https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium

https://console.tiyaro.ai/explore/rinna-japanese-roberta-base

On Tiyaro, every model in rinnakk/japanese-gpt2 will get its own:

Dedicated model card (e.g. https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium

Model demo (e.g. https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium/demo)

Unique Inference API (e.g. https://api.tiyaro.ai/explore/huggingface/1//rinna/japanese-gpt2-medium)

Sample code snippets and swagger spec for the API

Users will also be able to compare your model with other models of similar types on various parameters using Tiyaro Experiments (https://tiyaro.ai/blog/ocr/)

—- I am from Tiyaro.ai (https://tiyaro.ai/). We are working on enabling developers to instantly evaluate, use and customize the world’s best AI. We are constantly working on adding new features to Tiyaro EasyTrain, EasyServe & Experiments, to make the best use of your ML model, and making AI more accessible for anyone.

Sincerely, I-Jong Lin
size/XS
opened by ijonglin 0
add tokenizer & model
Adding tokenizers and a modeling file so that your models work without special tips, except for [MASK] problem.

For rinna/japanese-roberta-base:

Adding tokenization_roberta_japanese.py, modeling_roberta_japanese.py, and modeling_tf_roberta_japanese.py.

The difference between T5Tokenizer and RobertaJapaneseTokenizer

RobertaJapaneseTokenizer will add [CLS] automatically.

~~In languages without inter-word whitespaces, such as Japanese and Chinese, you should have trained SentencePiece with --add_dummy_prefix set to false. With --add_dummy_prefix set to true, extra whitespace tokens will appear. This is why A) Directly typing [MASK] in an input string and B) replacing a token with [MASK] after tokenization will yield different token sequences. Therefore, RobertaJapaneseTokenizer has a workaround for this problem.~~

~~Removed do_lower_case option. It is because do_lower_case option was not working in your pretraining code.~~

Enabled do_lower_case option.

The difference between RobertaModel and RobertaJapaneseModel

position_ids starts with 0. Therefore, it will be no longer necessary to explicitly provide position_ids.

For rinna/japanese-gpt2-*:

Adding tokenization_gpt2_japanese.py.

The difference between T5Tokenizer and GPT2JapaneseTokenizer

~~In languages without inter-word whitespaces, such as Japanese and Chinese, you should have trained SentencePiece with --add_dummy_prefix set to false. With --add_dummy_prefix set to true, extra whitespace tokens will appear. Therefore, GPT2JapaneseTokenizer has a workaround for this problem.~~

~~Removed do_lower_case option. It is because do_lower_case option was not working in your pretraining code.~~

Enabled do_lower_case option.

size/XXL
opened by azonti 7

Owner

rinna Co.,Ltd.

rinna株式会社

GitHub https://huggingface.co/rinna/japanese-gpt2-medium

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

10 Jan 6, 2023

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

14 Jul 23, 2022

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

7 Nov 7, 2022

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

2.3k Jan 1, 2023

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the Github. Currently, it just works properly on Python but not bad at other languages (thanks to GPT-2's power).

91 Sep 23, 2022

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ