Japanese NLP Library

Overview

Japanese NLP Library


Back to Home

1   Requirements

1.1   Links

  • All code at jProcessing Repo GitHub
  • PyPi Python Package
clone [email protected]:kevincobain2000/jProcessing.git

1.2   Install

In Terminal

bash$ python setup.py install

1.3   History

  • 0.2

    • Sentiment Analysis of Japanese Text
  • 0.1
    • Morphologically Tokenize Japanese Sentence
    • Kanji / Hiragana / Katakana to Romaji Converter
    • Edict Dictionary Search - borrowed
    • Edict Examples Search - incomplete
    • Sentence Similarity between two JP Sentences
    • Run Cabocha(ISO--8859-1 configured) in Python.
    • Longest Common String between Sentences
    • Kanji to Katakana Pronunciation
    • Hiragana, Katakana Chart Parser

2   Libraries and Modules

2.1   Tokenize jTokenize.py

In Python

>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print list_of_tokens
>>> print '--'.join(list_of_tokens).encode('utf-8')

Returns:

... [u'\u79c1', u'\u306f', u'\u5f7c', u'\u3092', u'\uff15'...]
... 私--は--彼--を--5--日--前--、--つまり--この--前--の--金曜日--に--駅--で--見かけ--た

Katakana Pronunciation:

>>> print '--'.join(jReads(input_sentence)).encode('utf-8')
... ワタシ--ハ--カレ--ヲ--ゴ--ニチ--マエ--、--ツマリ--コノ--マエ--ノ--キンヨウビ--ニ--エキ--デ--ミカケ--タ

2.2   Cabocha jCabocha.py

Run Cabocha with original EUCJP or IS0-8859-1 configured encoding, with utf8 python

>>> from jNlp.jCabocha import cabocha
>>> print cabocha(input_sentence).encode('utf-8')

Output:

">
<sentence>
 <chunk id="0" link="8" rel="D" score="0.971639" head="0" func="1">
  <tok id="0" read="ワタシ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">私tok>
  <tok id="1" read="" base="" pos="助詞-係助詞" ctype="" cform="" ne="O">はtok>
 chunk>
 <chunk id="1" link="2" rel="D" score="0.488672" head="2" func="3">
  <tok id="2" read="カレ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">彼tok>
  <tok id="3" read="" base="" pos="助詞-格助詞-一般" ctype="" cform="" ne="O">をtok>
 chunk>
 <chunk id="2" link="8" rel="D" score="2.25834" head="6" func="6">
  <tok id="4" read="" base="" pos="名詞-数" ctype="" cform="" ne="B-DATE">5tok>
  <tok id="5" read="ニチ" base="" pos="名詞-接尾-助数詞" ctype="" cform="" ne="I-DATE">日tok>
  <tok id="6" read="マエ" base="" pos="名詞-副詞可能" ctype="" cform="" ne="I-DATE">前tok>
  <tok id="7" read="" base="" pos="記号-読点" ctype="" cform="" ne="O">、tok>
 chunk>

2.3   Kanji / Katakana /Hiragana to Tokenized Romaji jConvert.py

Uses data/katakanaChart.txt and parses the chart. See katakanaChart.

>>> from jNlp.jConvert import *
>>> input_sentence = u'気象庁が21日午前4時48分、発表した天気概況によると、'
>>> print ' '.join(tokenizedRomaji(input_sentence))
>>> print tokenizedRomaji(input_sentence)
...kisyoutyou ga ni ichi nichi gozen yon ji yon hachi hun  hapyou si ta tenki gaikyou ni yoru to
...[u'kisyoutyou', u'ga', u'ni', u'ichi', u'nichi', u'gozen',...]

katakanaChart.txt

2.4   Longest Common String Japanese jProcessing.py

On English Strings

>>> from jNlp.jProcessing import long_substr
>>> a = 'Once upon a time in Italy'
>>> b = 'Thre was a time in America'
>>> print long_substr(a, b)

Output

...a time in

On Japanese Strings

>>> a = u'これでアナタも冷え知らず'
>>> b = u'これでア冷え知らずナタも'
>>> print long_substr(a, b).encode('utf-8')

Output

...冷え知らず

2.5   Similarity between two sentences jProcessing.py

Uses MinHash by checking the overlap http://en.wikipedia.org/wiki/MinHash

English Strings:
>>> from jNlp.jProcessing import Similarities
>>> s = Similarities()
>>> a = 'There was'
>>> b = 'There is'
>>> print s.minhash(a,b)
...0.444444444444
Japanese Strings:
>>> from jNlp.jProcessing import *
>>> a = u'これは何ですか?'
>>> b = u'これはわからないです'
>>> print s.minhash(' '.join(jTokenize(a)), ' '.join(jTokenize(b)))
...0.210526315789

3   Edict Japanese Dictionary Search with Example sentences

3.1   Sample Ouput Demo

3.2   Edict dictionary and example sentences parser.

This package uses the EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group , and are used in conformance with the Group's licence .

Edict Parser By Paul Goins, see edict_search.py Edict Example sentences Parse by query, Pulkit Kathuria, see edict_examples.py Edict examples pickle files are provided but latest example files can be downloaded from the links provided.

3.3   Charset

Two files

  • utf8 Charset example file if not using src/jNlp/data/edict_examples

    To convert EUCJP/ISO-8859-1 to utf8

    iconv -f EUCJP -t UTF-8 path/to/edict_examples > path/to/save_with_utf-8
    
  • ISO-8859-1 edict_dictionary file

Outputs example sentences for a query in Japanese only for ambiguous words.

3.4   Links

Latest Dictionary files can be downloaded here

3.5   edict_search.py

author: Paul Goins License included linkToOriginal:

For all entries of sense definitions

>>> from jNlp.edict_search import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> kp = Parser(edict_path)
>>> for i, entry in enumerate(kp.search(query)):
...     print entry.to_string().encode('utf-8')

3.6   edict_examples.py

Note: Only outputs the examples sentences for ambiguous words (if word has one or more senses)
author: Pulkit Kathuria
>>> from jNlp.edict_examples import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> edict_examples_path = 'src/jNlp/data/edict_examples'
>>> search_with_example(edict_path, edict_examples_path, query)

Output

認める

Sense (1) to recognize;
  EX:01 我々は彼の才能を*認*めている。We appreciate his talent.

Sense (2) to observe;
  EX:01 x線写真で異状が*認*められます。We have detected an abnormality on your x-ray.

Sense (3) to admit;
  EX:01 母は私の計画をよいと*認*めた。Mother approved my plan.
  EX:02 母は決して私の結婚を*認*めないだろう。Mother will never approve of my marriage.
  EX:03 父は決して私の結婚を*認*めないだろう。Father will never approve of my marriage.
  EX:04 彼は女性の喫煙をいいものだと*認*めない。He doesn't approve of women smoking.
  ...

4   Sentiment Analysis Japanese Text

This section covers (1) Sentiment Analysis on Japanese text using Word Sense Disambiguation, Wordnet-jp (Japanese Word Net file name wnjpn-all.tab), SentiWordnet (English SentiWordNet file name SentiWordNet_3.*.txt).

4.1   Wordnet files download links

  1. http://nlpwww.nict.go.jp/wn-ja/eng/downloads.html
  2. http://sentiwordnet.isti.cnr.it/

4.2   How to Use

The following classifier is baseline, which works as simple mapping of Eng to Japanese using Wordnet and classify on polarity score using SentiWordnet.

  • (Adnouns, nouns, verbs, .. all included)
  • No WSD module on Japanese Sentence
  • Uses word as its common sense for polarity score
>>> from jNlp.jSentiments import *
>>> jp_wn = '../../../../data/wnjpn-all.tab'
>>> en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高!'
>>> print classifier.baseline(text)
...Pos Score = 0.625 Neg Score = 0.125
...Text is Positive

4.3   Japanese Word Polarity Score

>>> from jNlp.jSentiments import *
>>> jp_wn = '_dicts/wnjpn-all.tab' #path to Japanese Word Net
>>> en_swn = '_dicts/SentiWordNet_3.0.0_20100908.txt' #Path to SentiWordNet
>>> classifier = Sentiment()
>>> sentiwordnet, jpwordnet  = classifier.train(en_swn, jp_wn)
>>> positive_score = sentiwordnet[jpwordnet[u'全部']][0]
>>> negative_score = sentiwordnet[jpwordnet[u'全部']][1]
>>> print 'pos score = {0}, neg score = {1}'.format(positive_score, negative_score)
...pos score = 0.625, neg score = 0.0

5   Contacts

Author: pulkit[at]jaist.ac.jp [change at with @]
Comments
  • Cannot run sentiment analysis example (classifier)

    Cannot run sentiment analysis example (classifier)

    Hi everyone,

    I am trying to run your sentiment analysis demo and I am facing a cElementTree.ParseError. I am running on OSX 10.11 with Python 2.7. I downloaded the wordnet files (as of today: SentiWordNet_3.0.0_20130122.txt, with the current wn: 2010-10-22). I ran your example as you presented:

    >>> from jNlp.jSentiments import *
    >>> jp_wn = 'path_to/wnjpn-all.tab'
    >>> en_swn = 'path_to/SentiWordNet_3.0.0_20130122.txt'
    >>> classifier = Sentiment()
    >>> classifier.train(en_swn, jp_wn)
    >>> text = u'監督、俳優、ストーリー、演出、全部最高!'
    >>> print classifier.baseline(text)
    

    and obtain the following error:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "build/bdist.macosx-10.11-intel/egg/jNlp/jSentiments.py", line 55, in baseline
      File "build/bdist.macosx-10.11-intel/egg/jNlp/jSentiments.py", line 48, in polarScores_text
      File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
      File "<string>", line 124, in XML
    cElementTree.ParseError: not well-formed (invalid token): line 1, column 2
    

    However, the polarity score example works fine, and I obtain the right scores! If you have any idea, I'd be grateful for your help!

    Best,

    opened by renoust 6
  • jReads does not exist

    jReads does not exist

    Hey there,

    I compiled and installed all dependencies and now wanna run some of the examples presented here.

    >>> from jNlp.jConvert import *
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
      File "/Users/xxx/VirtualEnvs/venv-python2.7-django/lib/python2.7/site-packages/jProcessing-0.1-py2.7.egg/j
    Nlp/jConvert.py", line 4, in <module>
        from jNlp.jTokenize import jTokenize, jReads
    ImportError: cannot import name jReads
    

    I tried replacing the jReads with the jTokenize method but I didn't expect that to work :)

    I found and old implementation that I took and changed to using cabocha().

    def jReads(target_sent):
        sentence = etree.fromstring(cabocha(target_sent).encode('utf-8'))
        jReadsToks = []
        for chunk in sentence:
            for tok in chunk.findall('tok'):
                if tok.get("read"): jReadsToks.append(tok.get("read"))
        return jReadsToks
    

    However, I don't seem to be getting a valid XML:

    気象庁が21日午前4時48分、発表した天気概況によると、
    tokenizedRomaji(input_sentence)
    
    
    iconv_open is not supported
                               Traceback (most recent call last):
      File "<input>", line 1, in <module>
    Nlp/jConvert.py", line 42, in tokenizedRomaji2.7-django/lib/python2.7/site-packages/jProcessing-0.1-py2.7.egg/j
        for kataChunk in jReads(jSent):
      File "/Users/xxx/VirtualEnvs/venv-python2.7-django/lib/python2.7/site-packages/jProcessing-0.1-py2.7.egg/j
    Nlp/jTokenize.py", line 29, in jReads
        sentence = etree.fromstring(cabocha(target_sent).encode('utf-8'))
      File "<string>", line 124, in XML
    ParseError: not well-formed (invalid token): line 1, column 4
    

    I compiled and installed iconv but is this related to the problem?

    Also, I verified my installation of mecab and cabocha and both seem to work fine.

    But jReads really does not exist xP

    opened by npx 6
  • OSError: [Errno 2]  for subprocess

    OSError: [Errno 2] for subprocess

    when try the samples in the help file with python 2.7 macbook air ,

    from jNlp.jConvert import * input_sentence = u'気象庁が21日午前4時48分、発表した天気概況によると、' print ' '.join(tokenizedRomaji(input_sentence)) print tokenizedRomaji(input_sentence)

    results was :

    Traceback (most recent call last): File "", line 1, in File "build/bdist.macosx-10.11-intel/egg/jNlp/jConvert.py", line 42, in tokenizedRomaji File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 42, in jReads File "build/bdist.macosx-10.11-intel/egg/jNlp/jCabocha.py", line 24, in cabocha File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in init errread, errwrite) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory

    opened by deter3 4
  • Question about Classical Japanese

    Question about Classical Japanese

    Hello,

    Thank you for your wonderful library.

    I run an open source project, the Classical Language Toolkit, which helps researchers do NLP in ancient and classical languages.

    One of our contributors found your software and is interested in porting some of it for our users.

    But because I do not know Japanese, I am interested to learn whether jProcessing is suitable for old Japanese texts (say, up until the year AD 1600).

    Thanks again for sharing your software with the world. Feel free to be in touch with me directly at [email protected] if you prefer!

    opened by kylepjohnson 2
  • Installing Cabocha

    Installing Cabocha

    Is there an easy way to install Cabocha ?

    Their page doesn't include MacOs instructions, also they have some CRF++, MeCab etc which makes the process discouraging ....

    http://taku910.github.io/cabocha/

    opened by c0ze 1
  • Execution Error for classifier.baseline function

    Execution Error for classifier.baseline function

    Dear Kevincobain,

    First I am very much thankful to you for posting the step by step process to classify the Japan Sentiments. I tried to replicate the same as you have done.

    I used below code of yours

    from jNlp.jSentiments import * jp_wn = '../../../../data/wnjpn-all.tab' en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt' classifier = Sentiment() classifier.train(en_swn, jp_wn) text = u'監督、俳優、ストーリー、演出、全部最高!'

    Until above statement everything worked fine. But when I tried to use below statement

    print classifier.baseline(text)

    I got below error Traceback (most recent call last): File "", line 1, in File "build/bdist.linux-i686/egg/jNlp/jSentiments.py", line 55, in baseline File "build/bdist.linux-i686/egg/jNlp/jSentiments.py", line 48, in polarScores_text File "build/bdist.linux-i686/egg/jNlp/jTokenize.py", line 30, in jTokenize File "", line 124, in XML cElementTree.ParseError: not well-formed (invalid token): line 1, column 9

    Please help me in clearing the issue. Please tell what am I doing wrong.

    But when I classify the word sentiments I am able to do it properly.

    Please help me in clearing this issue

    opened by NaveenSrikanth 1
  • Python rtcclient 0.6.0 issue: ERROR client.RTCClient: not well-formed (invalid token): line 17, column 77

    Python rtcclient 0.6.0 issue: ERROR client.RTCClient: not well-formed (invalid token): line 17, column 77

    Hello Everyone,

    I'm trying to use Python rtcclient 0.6.0 and following the example code to get work items from RTC 6.0.3. The url ends with ccm and the RTC server doesn't have proxies. The url is like https://rtc-ccm-1.int.xxxx.com:9443/ccm/web/projects/xxxx (note: some characters are substituted with xxxx for confidential reason). Here are part of sample code I use:

    from rtcclient.utils import setup_basic_logging from rtcclient import RTCClient

    setup_basic_logging() url = "https://rtc-ccm-1.int.xxxx.com:9443/ccm/web/projects/xxxx" username = "myusername" password = "mypassword"

    myclient = RTCClient(url, username, password, ends_with_jazz=False) print myclient wk = myclient.getWorkitem("1631898")

    Here are execution results (note: some characters are substituted with xxxx for confidential reason): results.txt

    The script seems to connect the server without issue (no error in debug log). The print command prints out "RTC Server at https://rtc-ccm-1.int.xxxx.com:9443/ccm/web/projects/xxxx" and the work item "1631898" does exist. Don't know why it still throws out "Not found <Workitem 1631898>" error.

    If you have any idea, I would be grateful for your help!

    opened by kevinhe2017 0
  • Unsafe sentence tokenizer in sentiment analysis

    Unsafe sentence tokenizer in sentiment analysis

    Hi!

    In jSentiments.py, in polarScores_text(), you are processing each sentence by:

    for sent in text.split(u'。'):
       etc.
    

    This part actually crashes when you have an empty sentence coming in, that we can protect using:

    for sent in text.split(u'。'):
        if len(sent.strip()) == 0:
             continue
        etc.
    
    opened by renoust 0
  •  FileNotFoundError: [Errno 2] No such file or directory: 'cabocha': 'cabocha' while using jTokenize

    FileNotFoundError: [Errno 2] No such file or directory: 'cabocha': 'cabocha' while using jTokenize

    Hi All, I tried installing the package on Python3 environment. I have got errors pointing to 'print' statements and other syntax errors which are different in python3 . I made those changes and installed successfully. Now when i try to use jTokenize as shown in the example. i run into following error.

    FileNotFoundError: [Errno 2] No such file or directory: 'cabocha': 'cabocha'

    The code snippet i tried input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた' list_of_tokens = jTokenize(input_sentence)

    It would be helpful if anyone can help me fix it.

    Thank You in advance

    opened by jiterika 4
  • UnicodeDecodeError with classifier.baseline()

    UnicodeDecodeError with classifier.baseline()

    This is a similar but different issue an another posted here.

    $ python jnlp-test-sentencePolarityScore.py
    Traceback (most recent call last):
      File "jnlp-test-sentencePolarityScore.py", line 9, in <module>
        print classifier.baseline(text)
      File "build/bdist.macosx-10.13-intel/egg/jNlp/jSentiments.py", line 56, in baseline
      File "build/bdist.macosx-10.13-intel/egg/jNlp/jSentiments.py", line 49, in polarScores_text
      File "build/bdist.macosx-10.13-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
      File "build/bdist.macosx-10.13-intel/egg/jNlp/jCabocha.py", line 27, in cabocha
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 105: invalid start byte
    

    At first, I also had this same error with classifier.train(), but once I ran - ./configure --with-charset=utf8 for the mecab dictionary and for cabocha, the error disappeared.

    However, with classifier.baseline() the error remains. Is there another part of the toolchain that I need to configure for utf-8? Am I missing something really basic?

    Thanks!

    opened by jcneshi 2
  • Dealing with some issues and found a broken link

    Dealing with some issues and found a broken link

    1.2.2 Cabocha jCabocha.py

    Run Cabocha with original EUCJP or IS0-8859-1 configured encoding, with utf8 python

    If cobocha is configured as utf8 then see this http://nltk.googlecode.com/svn/trunk/doc/book-jp/ch12.html#cabocha

    I'm using Win7、python2.7, 32bits. In fact, I want to convert some Japanese strings to romaji(for legacy purpose). I've installed the newest Mecab(choose EUC-JP when install), CaboCha(choose utf-8 when install←it only have shift-jis and utf-8 to choose) and then, of course they are not match the code and cannot work(because your tools need to use "euc_jp").So the link above seem to solve the issue but dead.

    Further more, from the config shortcut file of Cabocha, I replace the UTF-8 with EUC-JS and run, it seems recompile successfully. Then I try to run your examples, it does have something works (called external Cabocha.exe without error) but nothing output, only a: [ ]

    opened by MXS2514 0
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

Jonas Belouadi 7 Nov 7, 2022
This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Koga Kobayashi 60 Nov 11, 2022
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 7, 2023
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
Code for evaluating Japanese pretrained models provided by NTT Ltd.

japanese-dialog-transformers 日本語の説明文はこちら This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialo

NTT Communication Science Laboratories 216 Dec 22, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 6, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

null 5 Oct 29, 2022
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

tanreinama 13 Aug 11, 2022
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

null 325 Jan 5, 2023