Japanese NLP Library

Pulkit Kathuria

Last update: Dec 27, 2022

Related tags

Overview

Japanese NLP Library

Back to Home

Contents

1 Requirements
- 1.1 Links
- 1.2 Install
- 1.3 History
2 Libraries and Modules
3 Edict Japanese Dictionary Search with Example sentences
4 Sentiment Analysis Japanese Text
5 Contacts

1 Requirements

Third Party Dependencies
- Cabocha Japanese Morphological parser http://sourceforge.net/projects/cabocha/
Python Dependencies
- Python 2.6.* or above

1.1 `Links`

All code at jProcessing Repo GitHub

Documentation and HomePage and Sphinx

PyPi Python Package

clone [email protected]:kevincobain2000/jProcessing.git

1.2 `Install`

In Terminal

bash$ python setup.py install

1.3 History

0.2
- Sentiment Analysis of Japanese Text
0.1
- Morphologically Tokenize Japanese Sentence
- Kanji / Hiragana / Katakana to Romaji Converter
- Edict Dictionary Search - borrowed
- Edict Examples Search - incomplete
- Sentence Similarity between two JP Sentences
- Run Cabocha(ISO--8859-1 configured) in Python.
- Longest Common String between Sentences
- Kanji to Katakana Pronunciation
- Hiragana, Katakana Chart Parser

2 Libraries and Modules

2.1 Tokenize `jTokenize.py`

In Python

>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = u'私は彼を５日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print list_of_tokens
>>> print '--'.join(list_of_tokens).encode('utf-8')

Returns:

... [u'\u79c1', u'\u306f', u'\u5f7c', u'\u3092', u'\uff15'...]
... 私--は--彼--を--５--日--前--、--つまり--この--前--の--金曜日--に--駅--で--見かけ--た

Katakana Pronunciation:

>>> print '--'.join(jReads(input_sentence)).encode('utf-8')
... ワタシ--ハ--カレ--ヲ--ゴ--ニチ--マエ--、--ツマリ--コノ--マエ--ノ--キンヨウビ--ニ--エキ--デ--ミカケ--タ

2.2 Cabocha `jCabocha.py`

Run Cabocha with original EUCJP or IS0-8859-1 configured encoding, with utf8 python

If cabocha is configured as utf8 then see this http://nltk.googlecode.com/svn/trunk/doc/book-jp/ch12.html#cabocha

>>> from jNlp.jCabocha import cabocha
>>> print cabocha(input_sentence).encode('utf-8')

Output:

私は彼を５日前、 ">

<sentence>
 <chunk id="0" link="8" rel="D" score="0.971639" head="0" func="1">
  <tok id="0" read="ワタシ" base="私" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">私tok>
  <tok id="1" read="ハ" base="は" pos="助詞-係助詞" ctype="" cform="" ne="O">はtok>
 chunk>
 <chunk id="1" link="2" rel="D" score="0.488672" head="2" func="3">
  <tok id="2" read="カレ" base="彼" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">彼tok>
  <tok id="3" read="ヲ" base="を" pos="助詞-格助詞-一般" ctype="" cform="" ne="O">をtok>
 chunk>
 <chunk id="2" link="8" rel="D" score="2.25834" head="6" func="6">
  <tok id="4" read="ゴ" base="５" pos="名詞-数" ctype="" cform="" ne="B-DATE">５tok>
  <tok id="5" read="ニチ" base="日" pos="名詞-接尾-助数詞" ctype="" cform="" ne="I-DATE">日tok>
  <tok id="6" read="マエ" base="前" pos="名詞-副詞可能" ctype="" cform="" ne="I-DATE">前tok>
  <tok id="7" read="、" base="、" pos="記号-読点" ctype="" cform="" ne="O">、tok>
 chunk>

2.3 Kanji / Katakana /Hiragana to Tokenized Romaji `jConvert.py`

Uses data/katakanaChart.txt and parses the chart. See katakanaChart.

>>> from jNlp.jConvert import *
>>> input_sentence = u'気象庁が２１日午前４時４８分、発表した天気概況によると、'
>>> print ' '.join(tokenizedRomaji(input_sentence))
>>> print tokenizedRomaji(input_sentence)

...kisyoutyou ga ni ichi nichi gozen yon ji yon hachi hun  hapyou si ta tenki gaikyou ni yoru to
...[u'kisyoutyou', u'ga', u'ni', u'ichi', u'nichi', u'gozen',...]

katakanaChart.txt

katakanaChartFile and hiraganaChartFile

2.4 Longest Common String Japanese `jProcessing.py`

On English Strings

>>> from jNlp.jProcessing import long_substr
>>> a = 'Once upon a time in Italy'
>>> b = 'Thre was a time in America'
>>> print long_substr(a, b)

Output

...a time in

On Japanese Strings

>>> a = u'これでアナタも冷え知らず'
>>> b = u'これでア冷え知らずナタも'
>>> print long_substr(a, b).encode('utf-8')

Output

...冷え知らず

2.5 Similarity between two sentences `jProcessing.py`

Uses MinHash by checking the overlap http://en.wikipedia.org/wiki/MinHash

English Strings:

>>> from jNlp.jProcessing import Similarities
>>> s = Similarities()
>>> a = 'There was'
>>> b = 'There is'
>>> print s.minhash(a,b)
...0.444444444444

Japanese Strings:

>>> from jNlp.jProcessing import *
>>> a = u'これは何ですか？'
>>> b = u'これはわからないです'
>>> print s.minhash(' '.join(jTokenize(a)), ' '.join(jTokenize(b)))
...0.210526315789

3 Edict Japanese Dictionary Search with Example sentences

3.2 Edict dictionary and example sentences parser.

This package uses the EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group , and are used in conformance with the Group's licence .

Edict Parser By Paul Goins, see edict_search.py Edict Example sentences Parse by query, Pulkit Kathuria, see edict_examples.py Edict examples pickle files are provided but latest example files can be downloaded from the links provided.

3.3 Charset

Two files

utf8 Charset example file if not using src/jNlp/data/edict_examples

To convert EUCJP/ISO-8859-1 to utf8
```
iconv -f EUCJP -t UTF-8 path/to/edict_examples > path/to/save_with_utf-8
```
ISO-8859-1 edict_dictionary file

Outputs example sentences for a query in Japanese only for ambiguous words.

3.4 Links

Latest Dictionary files can be downloaded here

3.5 `edict_search.py`

author:	Paul Goins License included linkToOriginal:

For all entries of sense definitions

>>> from jNlp.edict_search import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> kp = Parser(edict_path)
>>> for i, entry in enumerate(kp.search(query)):
...     print entry.to_string().encode('utf-8')

3.6 `edict_examples.py`

Note:	Only outputs the examples sentences for ambiguous words (if word has one or more senses)
author:	Pulkit Kathuria

>>> from jNlp.edict_examples import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> edict_examples_path = 'src/jNlp/data/edict_examples'
>>> search_with_example(edict_path, edict_examples_path, query)

Output

認める

Sense (1) to recognize;
  EX:01 我々は彼の才能を*認*めている。We appreciate his talent.

Sense (2) to observe;
  EX:01 ｘ線写真で異状が*認*められます。We have detected an abnormality on your x-ray.

Sense (3) to admit;
  EX:01 母は私の計画をよいと*認*めた。Mother approved my plan.
  EX:02 母は決して私の結婚を*認*めないだろう。Mother will never approve of my marriage.
  EX:03 父は決して私の結婚を*認*めないだろう。Father will never approve of my marriage.
  EX:04 彼は女性の喫煙をいいものだと*認*めない。He doesn't approve of women smoking.
  ...

4 Sentiment Analysis Japanese Text

This section covers (1) Sentiment Analysis on Japanese text using Word Sense Disambiguation, Wordnet-jp (Japanese Word Net file name wnjpn-all.tab), SentiWordnet (English SentiWordNet file name SentiWordNet_3.*.txt).

4.1 Wordnet files download links

4.2 How to Use

The following classifier is baseline, which works as simple mapping of Eng to Japanese using Wordnet and classify on polarity score using SentiWordnet.

(Adnouns, nouns, verbs, .. all included)
No WSD module on Japanese Sentence
Uses word as its common sense for polarity score

>>> from jNlp.jSentiments import *
>>> jp_wn = '../../../../data/wnjpn-all.tab'
>>> en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高！'
>>> print classifier.baseline(text)
...Pos Score = 0.625 Neg Score = 0.125
...Text is Positive

4.3 Japanese Word Polarity Score

>>> from jNlp.jSentiments import *
>>> jp_wn = '_dicts/wnjpn-all.tab' #path to Japanese Word Net
>>> en_swn = '_dicts/SentiWordNet_3.0.0_20100908.txt' #Path to SentiWordNet
>>> classifier = Sentiment()
>>> sentiwordnet, jpwordnet  = classifier.train(en_swn, jp_wn)
>>> positive_score = sentiwordnet[jpwordnet[u'全部']][0]
>>> negative_score = sentiwordnet[jpwordnet[u'全部']][1]
>>> print 'pos score = {0}, neg score = {1}'.format(positive_score, negative_score)
...pos score = 0.625, neg score = 0.0

5 Contacts

Author: pulkit[at]jaist.ac.jp [change at with @]

Author:	pulkit[at]jaist.ac.jp [change `at` with `@`]

Comments

Cannot run sentiment analysis example (classifier)

Hi everyone,

I am trying to run your sentiment analysis demo and I am facing a cElementTree.ParseError. I am running on OSX 10.11 with Python 2.7. I downloaded the wordnet files (as of today: SentiWordNet_3.0.0_20130122.txt, with the current wn: 2010-10-22). I ran your example as you presented:

>>> from jNlp.jSentiments import *
>>> jp_wn = 'path_to/wnjpn-all.tab'
>>> en_swn = 'path_to/SentiWordNet_3.0.0_20130122.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高！'
>>> print classifier.baseline(text)

and obtain the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.macosx-10.11-intel/egg/jNlp/jSentiments.py", line 55, in baseline
  File "build/bdist.macosx-10.11-intel/egg/jNlp/jSentiments.py", line 48, in polarScores_text
  File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
  File "<string>", line 124, in XML
cElementTree.ParseError: not well-formed (invalid token): line 1, column 2

However, the polarity score example works fine, and I obtain the right scores! If you have any idea, I'd be grateful for your help!

Best,

opened by renoust 6

jReads does not exist

Hey there,

I compiled and installed all dependencies and now wanna run some of the examples presented here.

>>> from jNlp.jConvert import *
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Users/xxx/VirtualEnvs/venv-python2.7-django/lib/python2.7/site-packages/jProcessing-0.1-py2.7.egg/j
Nlp/jConvert.py", line 4, in <module>
    from jNlp.jTokenize import jTokenize, jReads
ImportError: cannot import name jReads

I tried replacing the jReads with the jTokenize method but I didn't expect that to work :)

I found and old implementation that I took and changed to using cabocha().

def jReads(target_sent):
    sentence = etree.fromstring(cabocha(target_sent).encode('utf-8'))
    jReadsToks = []
    for chunk in sentence:
        for tok in chunk.findall('tok'):
            if tok.get("read"): jReadsToks.append(tok.get("read"))
    return jReadsToks

However, I don't seem to be getting a valid XML:

気象庁が２１日午前４時４８分、発表した天気概況によると、
tokenizedRomaji(input_sentence)


iconv_open is not supported
                           Traceback (most recent call last):
  File "<input>", line 1, in <module>
Nlp/jConvert.py", line 42, in tokenizedRomaji2.7-django/lib/python2.7/site-packages/jProcessing-0.1-py2.7.egg/j
    for kataChunk in jReads(jSent):
  File "/Users/xxx/VirtualEnvs/venv-python2.7-django/lib/python2.7/site-packages/jProcessing-0.1-py2.7.egg/j
Nlp/jTokenize.py", line 29, in jReads
    sentence = etree.fromstring(cabocha(target_sent).encode('utf-8'))
  File "<string>", line 124, in XML
ParseError: not well-formed (invalid token): line 1, column 4

I compiled and installed iconv but is this related to the problem?

Also, I verified my installation of mecab and cabocha and both seem to work fine.

But jReads really does not exist xP

opened by npx 6

OSError: [Errno 2] for subprocess

when try the samples in the help file with python 2.7 macbook air ,

from jNlp.jConvert import * input_sentence = u'気象庁が２１日午前４時４８分、発表した天気概況によると、' print ' '.join(tokenizedRomaji(input_sentence)) print tokenizedRomaji(input_sentence)

results was :

Traceback (most recent call last): File "", line 1, in File "build/bdist.macosx-10.11-intel/egg/jNlp/jConvert.py", line 42, in tokenizedRomaji File "build/bdist.macosx-10.11-intel/egg/jNlp/jTokenize.py", line 42, in jReads File "build/bdist.macosx-10.11-intel/egg/jNlp/jCabocha.py", line 24, in cabocha File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in init errread, errwrite) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory

opened by deter3 4
Question about Classical Japanese

Hello,

Thank you for your wonderful library.

I run an open source project, the Classical Language Toolkit, which helps researchers do NLP in ancient and classical languages.

One of our contributors found your software and is interested in porting some of it for our users.

But because I do not know Japanese, I am interested to learn whether jProcessing is suitable for old Japanese texts (say, up until the year AD 1600).

Thanks again for sharing your software with the world. Feel free to be in touch with me directly at [email protected] if you prefer!

opened by kylepjohnson 2
Installing Cabocha

Is there an easy way to install Cabocha ?

Their page doesn't include MacOs instructions, also they have some CRF++, MeCab etc which makes the process discouraging ....

http://taku910.github.io/cabocha/

opened by c0ze 1
Execution Error for classifier.baseline function

Dear Kevincobain,

First I am very much thankful to you for posting the step by step process to classify the Japan Sentiments. I tried to replicate the same as you have done.

I used below code of yours

from jNlp.jSentiments import * jp_wn = '../../../../data/wnjpn-all.tab' en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt' classifier = Sentiment() classifier.train(en_swn, jp_wn) text = u'監督、俳優、ストーリー、演出、全部最高！'

Until above statement everything worked fine. But when I tried to use below statement

print classifier.baseline(text)

I got below error Traceback (most recent call last): File "", line 1, in File "build/bdist.linux-i686/egg/jNlp/jSentiments.py", line 55, in baseline File "build/bdist.linux-i686/egg/jNlp/jSentiments.py", line 48, in polarScores_text File "build/bdist.linux-i686/egg/jNlp/jTokenize.py", line 30, in jTokenize File "", line 124, in XML cElementTree.ParseError: not well-formed (invalid token): line 1, column 9

Please help me in clearing the issue. Please tell what am I doing wrong.

But when I classify the word sentiments I am able to do it properly.

Please help me in clearing this issue

opened by NaveenSrikanth 1
Python rtcclient 0.6.0 issue: ERROR client.RTCClient: not well-formed (invalid token): line 17, column 77

Hello Everyone,

I'm trying to use Python rtcclient 0.6.0 and following the example code to get work items from RTC 6.0.3. The url ends with ccm and the RTC server doesn't have proxies. The url is like https://rtc-ccm-1.int.xxxx.com:9443/ccm/web/projects/xxxx (note: some characters are substituted with xxxx for confidential reason). Here are part of sample code I use:

from rtcclient.utils import setup_basic_logging from rtcclient import RTCClient

setup_basic_logging() url = "https://rtc-ccm-1.int.xxxx.com:9443/ccm/web/projects/xxxx" username = "myusername" password = "mypassword"

myclient = RTCClient(url, username, password, ends_with_jazz=False) print myclient wk = myclient.getWorkitem("1631898")

Here are execution results (note: some characters are substituted with xxxx for confidential reason): results.txt

The script seems to connect the server without issue (no error in debug log). The print command prints out "RTC Server at https://rtc-ccm-1.int.xxxx.com:9443/ccm/web/projects/xxxx" and the work item "1631898" does exist. Don't know why it still throws out "Not found <Workitem 1631898>" error.

If you have any idea, I would be grateful for your help!

opened by kevinhe2017 0
Unsafe sentence tokenizer in sentiment analysis
Hi!

In jSentiments.py, in polarScores_text(), you are processing each sentence by:

for sent in text.split(u'。'): etc.

This part actually crashes when you have an empty sentence coming in, that we can protect using:

for sent in text.split(u'。'): if len(sent.strip()) == 0: continue etc.
opened by renoust 0
FileNotFoundError: [Errno 2] No such file or directory: 'cabocha': 'cabocha' while using jTokenize

Hi All, I tried installing the package on Python3 environment. I have got errors pointing to 'print' statements and other syntax errors which are different in python3 . I made those changes and installed successfully. Now when i try to use jTokenize as shown in the example. i run into following error.

FileNotFoundError: [Errno 2] No such file or directory: 'cabocha': 'cabocha'

The code snippet i tried input_sentence = u'私は彼を５日前、つまりこの前の金曜日に駅で見かけた' list_of_tokens = jTokenize(input_sentence)

It would be helpful if anyone can help me fix it.

Thank You in advance

opened by jiterika 4

UnicodeDecodeError with classifier.baseline()

This is a similar but different issue an another posted here.

$ python jnlp-test-sentencePolarityScore.py
Traceback (most recent call last):
  File "jnlp-test-sentencePolarityScore.py", line 9, in <module>
    print classifier.baseline(text)
  File "build/bdist.macosx-10.13-intel/egg/jNlp/jSentiments.py", line 56, in baseline
  File "build/bdist.macosx-10.13-intel/egg/jNlp/jSentiments.py", line 49, in polarScores_text
  File "build/bdist.macosx-10.13-intel/egg/jNlp/jTokenize.py", line 30, in jTokenize
  File "build/bdist.macosx-10.13-intel/egg/jNlp/jCabocha.py", line 27, in cabocha
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 105: invalid start byte

At first, I also had this same error with classifier.train(), but once I ran - ./configure --with-charset=utf8 for the mecab dictionary and for cabocha, the error disappeared.

However, with classifier.baseline() the error remains. Is there another part of the toolchain that I need to configure for utf-8? Am I missing something really basic?

Thanks!

opened by jcneshi 2

Dealing with some issues and found a broken link

1.2.2 Cabocha jCabocha.py

Run Cabocha with original EUCJP or IS0-8859-1 configured encoding, with utf8 python

If cobocha is configured as utf8 then see this http://nltk.googlecode.com/svn/trunk/doc/book-jp/ch12.html#cabocha

I'm using Win7、python2.7, 32bits. In fact, I want to convert some Japanese strings to romaji(for legacy purpose). I've installed the newest Mecab(choose EUC-JP when install), CaboCha(choose utf-8 when install←it only have shift-jis and utf-8 to choose) and then, of course they are not match the code and cannot work(because your tools need to use "euc_jp").So the link above seem to solve the issue but dead.

Further more, from the config shortcut file of Cabocha, I replace the UTF-8 with EUC-JS and run, it seems recompile successfully. Then I try to run your examples, it does have something works (called external Cabocha.exe without error) but nothing output, only a: [ ]

opened by MXS2514 0

Owner

Pulkit Kathuria

GitHub http://readthedocs.org/docs/jprocessing/en/latest/

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

7 Nov 7, 2022

This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

60 Nov 11, 2022

Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

48 Dec 14, 2022

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

491 Jan 7, 2023

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ