Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Overview

japanese-gpt2

rinna-icon

This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium released on HuggingFace model hub by rinna.


Please open an issue (in English/日本語) if you encounter any problem using the code or using our models via Huggingface.


Train a Japanese GPT-2 from scratch on your own machine

  1. Download training corpus Japanese CC-100 and extract the ja.txt file.

  2. Move the ja.txt file or modify src/corpus/jp_cc100/config.py to match the filepath of ja.txt with self.raw_data_dir in the config file.

  3. Split ja.txt to smaller files by running:

cd src/
python -m corpus.jp_cc100.split_to_small_files
  1. Train a medium-sized GPT-2 on 4 GPUs by running:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m task.pretrain.train --n_gpus 4 --save_model True --enable_log True

Interact with the trained model

Assume you have run the training script and saved your medium-sized GPT-2 to data/model/gpt2-medium-xxx.checkpoint. Run the following command to use it to complete text on one GPU by nucleus sampling with p=0.95 and k=40:

CUDA_VISIBLE_DEVICES=0 python -m task.pretrain.interact --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --gen_type top --top_p 0.95 --top_k 40

Prepare files for uploading to Huggingface

  1. Make your Huggingface account; Create a model repo; Clone it to your local machine.

  2. Create model and config files from a checkpoint by running:

python -m task.pretrain.checkpoint2huggingface --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --save_dir {huggingface's model repo directory}
  1. Validate the created files by running:
python -m task.pretrain.check_huggingface --model_dir {huggingface's model repo directory}
  1. Add files, commit, and push to your Huggingface repo.

Customize your training script

Check available arguments by running:

python -m task.pretrain.train --help

License

The MIT license

Comments
  • License Issues

    License Issues

    Hi, I'm @singletongue, a maintainer of the cl-tohoku/bert-japanese.

    Thank you for sharing your great work.

    However, I'm a little concerned that some parts of your code src/corpus/build_pretrain_dataset.py are possibly taken from our code make_corpus_wiki.py. Since we are releasing our codes under the Apache License 2.0, It might be better if you adopted the same license, not the MIT license.

    Thank you.

    opened by singletongue 6
  • rinna RoBERTa's max_length is 510 not 512?

    rinna RoBERTa's max_length is 510 not 512?

    Hi, I have been using rinna RoBERTa for a while now. I have a question. The max_length of rinna RoBERTa is 510 (not 512), right? Is this the intended result? If this was the intended result, why did you use 510 instead of 512 for max_length?

    rinna RoBERTa's padding_idx is 3 (not 1). So I think the starting position of position_embeddings is padding_idx+1=4 as in the following problem, but the size of position_embeddings in rinna RoBERTa is (514, 768). If I actually enter text with a length of 512, I get an index error.

    • https://github.com/pytorch/fairseq/issues/1187
    opened by masayakondo 4
  • Tensor size does not match

    Tensor size does not match

    Description

    GPT-2 train fails with an error "RuntimeError: The size of tensor a (768) must match the size of tensor b (1024) at non-singleton dimension 3".

    I followed the steps of "Train japanese-gpt2-xsmall from scratch", except that n_gpus was set to 1 and mecab_dict_path was changed to the path of unidic-csj-3.0.1.1.

    What's wrong?

    Full output of python -m task.pretrain_gpt2.train:

    local rank: [0], global_rank: [0]
    Number of training files: 502
    Number of dev files: 1
    ----- Loading dev data -----
    {'n_docs': 10000, 'n_sents': 131762, 'n_tokens': 4241376}
    ----- Hyper-parameters -----
    balanced_corpora: None
    batch_size: 20
    check_loss_after_n_step: 100.0
    checkpoint_path: None
    corpora: ['jp_cc100', 'jp_wiki']
    enable_log: True
    eval_batch_size: 40
    filename_note: None
    init_lr: 0.0007
    l2_penalty: 0.01
    master_port: 12321
    max_grad_norm: 1.0
    max_seq_len: 1024
    model_config_filepath: model/gpt2-ja-xsmall-config.json
    model_size: xsmall
    n_accum_steps: 3
    n_epochs: 10
    n_gpus: 1
    n_nodes: 1
    n_train_files_per_group: 10
    n_training_steps: 1600000
    n_warmup_steps: 2000.0
    node_rank: 0
    resume_training: False
    save_model: True
    seed: 42
    small_data: False
    use_amp: True
    validate_after_n_step: 5000.0
    world_size: 1
    {'n_docs': 1367409, 'n_sents': 8632681, 'n_tokens': 288213354}
    Traceback (most recent call last):
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/var/tmp/hiroki/japanese-pretrained-models/src/task/pretrain_gpt2/train.py", line 580, in <module>
        train(0, config)
      File "/var/tmp/hiroki/japanese-pretrained-models/src/task/pretrain_gpt2/train.py", line 409, in train
        loss, ppl = forward_step(model, tokenizer, batch_data)
      File "/var/tmp/hiroki/japanese-pretrained-models/src/task/pretrain_gpt2/train.py", line 85, in forward_step
        gpt2_outputs = model(input_ids=input_ids, return_dict=True)
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 904, in forward
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 752, in forward
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 290, in forward
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 241, in forward
      File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 176, in _attn
    RuntimeError: The size of tensor a (768) must match the size of tensor b (1024) at non-singleton dimension 3
    

    Environment

    python == 3.8.13 PyTorch == 1.12.1 transformers == 4.4.2

    opened by hirokisince1998 3
  • The load_docs_from_filepath method in src/task/pretrain_roberta/train.py just return empty list.

    The load_docs_from_filepath method in src/task/pretrain_roberta/train.py just return empty list.

    The load_docs_from_filepath method in src/task/pretrain_roberta/train.py only return empty list. Is it intended behavior? Thank you.

    def load_docs_from_filepath(filepath, tokenizer):
        docs = []
        with open(filepath, encoding="utf-8") as f:
            doc = []
            for line in f:
                line = line.strip()
                if line == "":
                    if len(doc) > 0:
                        docs.append(doc)
                    doc = []
                else:
                    sent = line
                    tokens = tokenizer.tokenize(sent)
                    token_ids = tokenizer.convert_tokens_to_ids(tokens)
                    if len(token_ids) > 0:
                        doc.append(token_ids)
        return docs
    
    opened by HiroshigeAoki 2
  • Train japanese-gpt2-xsmall from scratch

    Train japanese-gpt2-xsmall from scratch

    After the following command,

    python -m corpus.jp_wiki.build_pretrain_dataset

    the following command is necessary for training japanese-gpt2-xsmall from scratch.

    python -m corpus.jp_wiki.split_to_small_files

    If so, please update the usage.

    opened by jurader 1
  • Please update data’s url.

    Please update data’s url.

    I noticed that the wikipedia dataset has been updated in all languages.

    as is (src/coupus/jp_wiki/config.py)

    class Config(object):
        def __init__(self):
            self.corpus_name = "jp_wiki"
    
            # Management
            self.download_link = "https://dumps.wikimedia.org/other/cirrussearch/current/jawiki-20211025-cirrussearch-content.json.gz"
            self.raw_data_dir = "../data/jp_wiki/raw_data"
            self.raw_data_path = f"{self.raw_data_dir}/wiki.json.gz"
            self.extracted_data_path = f"{self.raw_data_dir}/wiki.extracted.txt"
            self.doc_data_dir = "../data/jp_wiki/doc_data"
    
    

    to be (src/coupus/jp_wiki/config.py)

    class Config(object):
        def __init__(self):
            self.corpus_name = "jp_wiki"
    
            # Management
            self.download_link = "https://dumps.wikimedia.org/other/cirrussearch/current/jawiki-20220228-cirrussearch-content.json.gz"
            self.raw_data_dir = "../data/jp_wiki/raw_data"
            self.raw_data_path = f"{self.raw_data_dir}/wiki.json.gz"
            self.extracted_data_path = f"{self.raw_data_dir}/wiki.extracted.txt"
            self.doc_data_dir = "../data/jp_wiki/doc_data"
    
    
    opened by spider-man-tm 1
  • Please add

    Please add "tokenizer_class" in "config.json"

    Please add tokenizer_class in config.json like

      "tokenizer_class": "T5Tokenizer",
    

    . This enables use of AutoTokenizer like

    tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-1b")
    

    instead of

    tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-gpt-1b")
    

    (Other models can be changed in the same way.)

    Related to: https://github.com/cl-tohoku/bert-japanese/issues/24

    opened by shirayu 1
  • Japanese Wikipedia dump link has changed

    Japanese Wikipedia dump link has changed

    First of all, thanks for great project!

    Currently, wikipedia link is fixed to https://dumps.wikimedia.org/other/cirrussearch/20210329/jawiki-20210329-cirrussearch-content.json.gz. However, it looks like the manager dispose the dump as it becomes older. The latest version is https://dumps.wikimedia.org/other/cirrussearch/20211025/jawiki-20211025-cirrussearch-content.json.gz. It would be grateful to note it in README.

    (I also find that CC-100 link is broken now, but it is not your fault.)

    opened by kaisugi 1
  • Can I use `rinna/japanese-roberta-base` through `AutoTokenizer` ?

    Can I use `rinna/japanese-roberta-base` through `AutoTokenizer` ?

    Hi, thank you very much for publishing such a wonderful Japanese pre-trained model! I am very happy to use this model.

    I would like to load the pre-trained tokenizer from AutoTokenizer.from_pretrained, but I encountered the following error. Do you support loading the pre-trained tokenizer from AutoTokenizer.from_pretrained ?

    $ python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('rinna/japanese-roberta-base')"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 423, in from_pretrained
        return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
      File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1709, in from_pretrained
        return cls._from_pretrained(
      File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1722, in _from_pretrained
        slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
      File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1781, in _from_pretrained
        tokenizer = cls(*init_inputs, **init_kwargs)
      File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/models/roberta/tokenization_roberta.py", line 159, in __init__
        super().__init__(
      File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2.py", line 179, in __init__
        with open(vocab_file, encoding="utf-8") as vocab_handle:
    TypeError: expected str, bytes or os.PathLike object, not NoneType
    

    Fortunately, AutoModel.from_pretrained can be run successfully (the warning message can be ignored this time).

    $ python -c "from transformers import AutoModel; AutoModel.from_pretrained('rinna/japanese-roberta-base')"
    Some weights of RobertaModel were not initialized from the model checkpoint at rinna/japanese-roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    

    The following is my system environment:

    • python 3.8.8
    • transformers 4.5.1

    I would appreciate any advice on how to load it this way. Thanks.

    opened by shunk031 1
  • japanese-roberta-base/README.md typo

    japanese-roberta-base/README.md typo

    Hello. I am enjoying researching with this model😊

    I found a typo.

    https://huggingface.co/rinna/japanese-roberta-base/blob/main/README.md

    typo?

    masked_idx = 6 # This is 5
    

    reason

    tokens [MASK] index is 5.

    print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']
    

    Please check it out.

    opened by umi-tyaahan 1
  • docs: demo, experiments and live inference API on Tiyaro

    docs: demo, experiments and live inference API on Tiyaro

    Hello Maintainer of Github repo rinnakk/japanese-pretrained-models!

    Thank you for your work on rinnakk/japanese-pretrained-models. This GitHub project is interesting, and we think that it would be a great addition to make this work instantly discoverable & available as an API for all your users, to quickly try and use it in their applications.

    The list of model card(s) covered by this PR are:

    • https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium
    • https://console.tiyaro.ai/explore/rinna-japanese-roberta-base

    On Tiyaro, every model in rinnakk/japanese-pretrained-models will get its own:

    • Dedicated model card (e.g. https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium
    • Model demo (e.g. https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium/demo)
    • Unique Inference API (e.g. https://api.tiyaro.ai/explore/huggingface/1//rinna/japanese-gpt2-medium)
    • Sample code snippets and swagger spec for the API

    Users will also be able to compare your model with other models of similar types on various parameters using Tiyaro Experiments (https://tiyaro.ai/blog/ocr/)

    —- I am from Tiyaro.ai (https://tiyaro.ai/). We are working on enabling developers to instantly evaluate, use and customize the world’s best AI. We are constantly working on adding new features to Tiyaro EasyTrain, EasyServe & Experiments, to make the best use of your ML model, and making AI more accessible for anyone.

    Sincerely, I-Jong Lin

    size/XS 
    opened by ijonglin 0
  • docs: demo, experiments and live inference API on Tiyaro

    docs: demo, experiments and live inference API on Tiyaro

    Hello Maintainer of Github repo rinnakk/japanese-gpt2!

    Thank you for your work on rinnakk/japanese-gpt2. This GitHub project is interesting, and we think that it would be a great addition to make this work instantly discoverable & available as an API for all your users, to quickly try and use it in their applications.

    The list of model card(s) covered by this PR are:

    • https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium
    • https://console.tiyaro.ai/explore/rinna-japanese-roberta-base

    On Tiyaro, every model in rinnakk/japanese-gpt2 will get its own:

    • Dedicated model card (e.g. https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium
    • Model demo (e.g. https://console.tiyaro.ai/explore/rinna-japanese-gpt2-medium/demo)
    • Unique Inference API (e.g. https://api.tiyaro.ai/explore/huggingface/1//rinna/japanese-gpt2-medium)
    • Sample code snippets and swagger spec for the API

    Users will also be able to compare your model with other models of similar types on various parameters using Tiyaro Experiments (https://tiyaro.ai/blog/ocr/)

    —- I am from Tiyaro.ai (https://tiyaro.ai/). We are working on enabling developers to instantly evaluate, use and customize the world’s best AI. We are constantly working on adding new features to Tiyaro EasyTrain, EasyServe & Experiments, to make the best use of your ML model, and making AI more accessible for anyone.

    Sincerely, I-Jong Lin

    size/XS 
    opened by ijonglin 0
  • add tokenizer & model

    add tokenizer & model

    Adding tokenizers and a modeling file so that your models work without special tips, except for [MASK] problem.

    For rinna/japanese-roberta-base:

    Adding tokenization_roberta_japanese.py, modeling_roberta_japanese.py, and modeling_tf_roberta_japanese.py.

    The difference between T5Tokenizer and RobertaJapaneseTokenizer

    1. RobertaJapaneseTokenizer will add [CLS] automatically.
    2. ~~In languages without inter-word whitespaces, such as Japanese and Chinese, you should have trained SentencePiece with --add_dummy_prefix set to false. With --add_dummy_prefix set to true, extra whitespace tokens will appear. This is why A) Directly typing [MASK] in an input string and B) replacing a token with [MASK] after tokenization will yield different token sequences. Therefore, RobertaJapaneseTokenizer has a workaround for this problem.~~
    3. ~~Removed do_lower_case option. It is because do_lower_case option was not working in your pretraining code.~~
    4. Enabled do_lower_case option.

    The difference between RobertaModel and RobertaJapaneseModel

    1. position_ids starts with 0. Therefore, it will be no longer necessary to explicitly provide position_ids.

    For rinna/japanese-gpt2-*:

    Adding tokenization_gpt2_japanese.py.

    The difference between T5Tokenizer and GPT2JapaneseTokenizer

    1. ~~In languages without inter-word whitespaces, such as Japanese and Chinese, you should have trained SentencePiece with --add_dummy_prefix set to false. With --add_dummy_prefix set to true, extra whitespace tokens will appear. Therefore, GPT2JapaneseTokenizer has a workaround for this problem.~~
    2. ~~Removed do_lower_case option. It is because do_lower_case option was not working in your pretraining code.~~
    3. Enabled do_lower_case option.
    size/XXL 
    opened by azonti 7
Owner
rinna Co.,Ltd.
rinna株式会社
rinna Co.,Ltd.
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 6, 2023
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

Jonas Belouadi 7 Nov 7, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 1, 2023
Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the Github. Currently, it just works properly on Python but not bad at other languages (thanks to GPT-2's power).

Galois Autocompleter 91 Sep 23, 2022
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 6, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

null 5 Oct 29, 2022
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

tanreinama 13 Aug 11, 2022
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

null 325 Jan 5, 2023
This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Koga Kobayashi 60 Nov 11, 2022
A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

CodeJ A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex) Install requirements pip install -r

TheProtagonist 1 Dec 6, 2021