Codes to pre-train Japanese T5 models

You might also like...
Auto translate textbox from Japanese to English or Indonesia
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Script to download some free japanese lessons in portuguse from NHK
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

A Japanese tokenizer based on recurrent neural networks
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Comments
  • Restart of  a preemptible TPU sometimes does not work

    Restart of a preemptible TPU sometimes does not work

    Sometimes the training process (t5.models.mesh_transformer_main) working with a preemptible TPU does not finish (with an error exit code) and freezes. This is an example of the log.

    I0902 03:33:36.516501 140070334211904 basic_session_run_hooks.py:260] loss = 1.109375, step = 488600 (45.410 sec)
    INFO:tensorflow:global_step/sec: 2.20221
    I0902 03:33:36.518121 140070334211904 tpu_estimator.py:2402] global_step/sec: 2.20221
    INFO:tensorflow:examples/sec: 140.942
    I0902 03:33:36.518576 140070334211904 tpu_estimator.py:2403] examples/sec: 140.942
    INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
    I0902 03:33:36.520152 140070334211904 tpu_estimator.py:616] Enqueue next (100) batch(es) of data to infeed.
    INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
    I0902 03:33:36.520488 140070334211904 tpu_estimator.py:620] Dequeue next (100) batch(es) of data from outfeed.
    INFO:tensorflow:Outfeed finished for iteration (1862, 53)
    I0902 03:34:01.018308 140066416998144 tpu_estimator.py:289] Outfeed finished for iteration (1862, 53)
    INFO:tensorflow:ShutdownHook: lame workers found: HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0)
    I0902 03:34:21.925864 140070334211904 session_support.py:391] ShutdownHook: lame workers found: HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0)
    INFO:tensorflow:ShutdownHook: saving checkpoint to gs://somewhere/model.ckpt
    I0902 03:34:21.941661 140070334211904 session_support.py:394] ShutdownHook: saving checkpoint to gs://somewhere/model.ckpt
    INFO:tensorflow:No save on shutdown when there are user-defined CheckpointSaverHooks
    I0902 03:34:21.942317 140070334211904 tpu_estimator.py:2370] No save on shutdown when there are user-defined CheckpointSaverHooks
    INFO:tensorflow:Shutting down HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0).
    I0902 03:34:21.942646 140070334211904 session_support.py:150] Shutting down HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0).
    INFO:tensorflow:Configuring worker heartbeat: shutdown_mode: SHUTDOWN_AFTER_TIMEOUT
    watchdog_config {
      timeout_ms: 60000
    }
    exit_code {
      exit_code: 42
    }
    
    I0902 03:34:21.943512 140070334211904 session_support.py:104] Configuring worker heartbeat: shutdown_mode: SHUTDOWN_AFTER_TIMEOUT
    watchdog_config {
      timeout_ms: 60000
    }
    exit_code {
      exit_code: 42
    }
    
    INFO:tensorflow:Waiting 70.00 seconds for worker shutdown.
    I0902 03:34:21.945668 140070334211904 session_support.py:159] Waiting 70.00 seconds for worker shutdown.
    INFO:tensorflow:Resetting coordinator.
    I0902 03:35:32.017142 140070334211904 session_support.py:423] Resetting coordinator.
    INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session w
    ill be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeat
    edly, try increasing the number of parameter servers assigned to the job. Error: Resetting session loop due to worker shutdown.
    I0902 03:35:32.020745 140070334211904 monitored_session.py:1286] An error was raised. This may be due to a preemption in a connected worker or parameter server. The c
    urrent session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in t
    he parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Resetting session loop due to worker
     shutdown.
    
    opened by shirayu 1
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

Jonas Belouadi 7 Nov 7, 2022
Code for evaluating Japanese pretrained models provided by NTT Ltd.

japanese-dialog-transformers 日本語の説明文はこちら This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialo

NTT Communication Science Laboratories 216 Dec 22, 2022
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

EleutherAI 3.1k Jan 8, 2023
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking 用作NLPIR实验室, Pre-training

ZYMa 12 Apr 7, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 7, 2023
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022