japanese-gpt2
This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium released on HuggingFace model hub by rinna.
Please open an issue (in English/日本語) if you encounter any problem using the code or using our models via Huggingface.
Train a Japanese GPT-2 from scratch on your own machine
-
Download training corpus Japanese CC-100 and extract the
ja.txt
file. -
Move the
ja.txt
file or modifysrc/corpus/jp_cc100/config.py
to match the filepath ofja.txt
withself.raw_data_dir
in the config file. -
Split
ja.txt
to smaller files by running:
cd src/
python -m corpus.jp_cc100.split_to_small_files
- Train a medium-sized GPT-2 on 4 GPUs by running:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m task.pretrain.train --n_gpus 4 --save_model True --enable_log True
Interact with the trained model
Assume you have run the training script and saved your medium-sized GPT-2 to data/model/gpt2-medium-xxx.checkpoint
. Run the following command to use it to complete text on one GPU by nucleus sampling with p=0.95
and k=40
:
CUDA_VISIBLE_DEVICES=0 python -m task.pretrain.interact --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --gen_type top --top_p 0.95 --top_k 40
Prepare files for uploading to Huggingface
-
Make your Huggingface account; Create a model repo; Clone it to your local machine.
-
Create model and config files from a checkpoint by running:
python -m task.pretrain.checkpoint2huggingface --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --save_dir {huggingface's model repo directory}
- Validate the created files by running:
python -m task.pretrain.check_huggingface --model_dir {huggingface's model repo directory}
- Add files, commit, and push to your Huggingface repo.
Customize your training script
Check available arguments by running:
python -m task.pretrain.train --help