Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Christoffer Aakre

Last update: Jul 23, 2022

Related tags

Text Data & NLP japanese-ebook-analysis

Overview

japanese-ebook-analysis

This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technical users. You can analyse an ebook, and see the following information:

The length of the book in words
The length of the book in characters
The number of unique words used in the book
The number of unique words that are only used once in the book
The percentage of unique words that are only used once
The number of unique characters used
The number of unique characters that are only used once
The percentage of unique characters that are only used once
A list of all the words used in the book as well as how often they are used
A list of all the characters used in the book as well as how often they are used

For text processing, we use MeCab

Usage

Currently, the project is not deployed anywhere, so to use the service, you will need to follow the steps below in the development section to get the server running.

Upload a .epub file containing japanese text to the server
The server will redirect you to a page showing you information about the ebook. You can then also click the 'See more details' button to see all the generated data, including a list of all the words used together with how many occurences there are for each word, and the same for the characters as well.

Development

Clone repository: git clone https://github.com/christofferaakre/japanese-ebook-analysis.git
Make sure you have mecab set up on your system. See http://www.robfahey.co.uk/blog/japanese-text-analysis-in-python/
(Only required if you will actually upload ebooks or run the analyse_epub.py script), which you will not need to do to contribute to other parts of the app. for a good guide on how to set it up.
Install python dependencies: pip install -r requirements.txt
Install other dependencies (these all need to be in your system path):
- pandoc
Run ./app.py to start the flask dev server

Contributing

I'm very happy for any happy contributions! Before contributing, please have a look at CONTRIBUTING.md.

To see what needs work on, have a look at the repo's Issues and its Pull requests.

Feel free to submit your own issue or pull request about a new feature or anything else. When submitting a pull request, don't be afraid to modify any of the files; I'm not very attached to the coding style used in the repo.

Comments

Show frequency distribution histogram and frequency metrics

As of https://github.com/christofferaakre/japanese-ebook-analysis/pull/10, we now have access to frequency information from several different frequency lists, as well as an overall frequency that takes into account all the frequency lists. We can use this information to show histograms of the frequency distribution, and then we can also show some metrics regarding the frequency. However, I am not sure which metrics best sum up the overall frequency distribution.
enhancement help wanted good first issue

opened by christofferaakre 10
Rewrite of mecab parsing

I created a class to store all of the information we get from a mecab parse and changed the parsing method to parse() in order to avoid error with random bytes in the beginning of "surface".

opened by vdrummer 1
Show frequency rating for words

Use frequency lists to display the frequency rating of a word (1 being most common, 10000 being 10000th most common) in addition to the number of occurences in the book. Another good idea is to use frequency lists for several different domains (e.g. Slice of Life, Shounen anime, novels, etc)

Credit to mods at r/learnjapanese for suggestion
enhancement help wanted good first issue

opened by christofferaakre 1
Support for know words analysis using wordlists

We can now calculate how many of the words in the book the user knows given a word-list. Currently, the word-list path is hardcoded to be 'word-list.txt' is the root directory, see the analyse_known_words function defined in analysis.py. This information has also been added to the books.html page to display it to the user. An example word-list can be found in data/jlpt-word-list.txt.

opened by christofferaakre 0
Add support for frequency lists

This pull request adds support for frequency lists. Frequency lists are put in the frequency-lists folder, and must have the same format as the ones that are currently there. The, when the user uploads an ebook, we find the frequency of every word in the book according to each frequency lists, and we also compute an overall frequency that takes into account all of them - Details about this can be found in the get_overall_frequency function defined in frequency_lists.js

opened by christofferaakre 0
Categorize words by JLPT level

Using the IDs from jmdict entries found in tagainijisho (CSV files) and jmdict, it is possible to create a mapping from words to JLPT levels. This allows us to show the distribution of words by their JLPT level.

I've already done such a mapping for German words in the jmdict, so I could provide a file with mappings or a script to create the mappings with.

opened by vdrummer 0
Analyse sentences as well as words
Currently, we are only analysing individual words. If we also break the book up into sentences, we get access to some useful metrics like average sentence length etc. Two options seem feasible to me:

Reconstruct the sentences by stringing together individual words until we hit a sentence-ending character like 。

Maybe mecab has a thing that lets you break text up into sentences rather than words

I think option 1) should be sufficient, as I can't really think of too many edge cases.
enhancement help wanted good first issue
opened by christofferaakre 0
Fix furigana removal for .txt files

Currently, we use furigana4epub to remove furigana from .epub files, but we don't remove furigana from .txt files. I have been unable to find a suitbale library/tool to do this, so I tried to implement something myself: https://github.com/christofferaakre/japanese-ebook-analysis/commit/90707b1da313a5b95a3caf3bc9e2c0402c8399d1 Unfortunately, it doesn't seem to quite work.
bug enhancement help wanted good first issue

opened by christofferaakre 0
Deploy the server somewhere on the web

Deploy the server somewhere on the web so that the user doesn't need to clone the repository, install dependencies, and then start the server themselves.
enhancement help wanted good first issue

opened by christofferaakre 0
Make the app look nice with CSS/JS/whatever
Currently, the service looks quite bad, and could look a lot better with some polish on the css/js/etc. The relevant files to look at are:

templates/books.html

templates/header.html

templates/upload_file.html

static/css/style.css

static/css/books.css

static/css/upload_file.css

enhancement help wanted good first issue
opened by christofferaakre 0

Owner

Christoffer Aakre

GitHub

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

7 Nov 7, 2022

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

WordDumb A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. Languages X-Ray supp

172 Dec 29, 2022

Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

12 Sep 28, 2022

A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

0 Aug 4, 2021

Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

Calibre Recipe für "Analyse & Kritik" Dies ist ein "Recipe" für die Konvertierung der aktuellen Ausgabe der Zeitung Analyse & Kritik in ein Ebook. Es

3 Jan 4, 2022

Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

1 Aug 15, 2022

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

491 Jan 7, 2023

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ