Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Overview

japanese-ebook-analysis

This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technical users. You can analyse an ebook, and see the following information:

  • The length of the book in words
  • The length of the book in characters
  • The number of unique words used in the book
  • The number of unique words that are only used once in the book
  • The percentage of unique words that are only used once
  • The number of unique characters used
  • The number of unique characters that are only used once
  • The percentage of unique characters that are only used once
  • A list of all the words used in the book as well as how often they are used
  • A list of all the characters used in the book as well as how often they are used

For text processing, we use MeCab

Usage

Currently, the project is not deployed anywhere, so to use the service, you will need to follow the steps below in the development section to get the server running.

  1. Upload a .epub file containing japanese text to the server
  2. The server will redirect you to a page showing you information about the ebook. You can then also click the 'See more details' button to see all the generated data, including a list of all the words used together with how many occurences there are for each word, and the same for the characters as well.

Development

  1. Clone repository: git clone https://github.com/christofferaakre/japanese-ebook-analysis.git
  2. Make sure you have mecab set up on your system. See http://www.robfahey.co.uk/blog/japanese-text-analysis-in-python/
    (Only required if you will actually upload ebooks or run the analyse_epub.py script), which you will not need to do to contribute to other parts of the app. for a good guide on how to set it up.
  3. Install python dependencies: pip install -r requirements.txt
  4. Install other dependencies (these all need to be in your system path):
    • pandoc
  5. Run ./app.py to start the flask dev server

Contributing

I'm very happy for any happy contributions! Before contributing, please have a look at CONTRIBUTING.md.

To see what needs work on, have a look at the repo's Issues and its Pull requests.

Feel free to submit your own issue or pull request about a new feature or anything else. When submitting a pull request, don't be afraid to modify any of the files; I'm not very attached to the coding style used in the repo.

Comments
  • Show frequency distribution histogram and frequency metrics

    Show frequency distribution histogram and frequency metrics

    As of https://github.com/christofferaakre/japanese-ebook-analysis/pull/10, we now have access to frequency information from several different frequency lists, as well as an overall frequency that takes into account all the frequency lists. We can use this information to show histograms of the frequency distribution, and then we can also show some metrics regarding the frequency. However, I am not sure which metrics best sum up the overall frequency distribution.

    enhancement help wanted good first issue 
    opened by christofferaakre 10
  • Rewrite of mecab parsing

    Rewrite of mecab parsing

    I created a class to store all of the information we get from a mecab parse and changed the parsing method to parse() in order to avoid error with random bytes in the beginning of "surface".

    opened by vdrummer 1
  • Show frequency rating for words

    Show frequency rating for words

    Use frequency lists to display the frequency rating of a word (1 being most common, 10000 being 10000th most common) in addition to the number of occurences in the book. Another good idea is to use frequency lists for several different domains (e.g. Slice of Life, Shounen anime, novels, etc)

    Credit to mods at r/learnjapanese for suggestion

    enhancement help wanted good first issue 
    opened by christofferaakre 1
  • Support for know words analysis using wordlists

    Support for know words analysis using wordlists

    We can now calculate how many of the words in the book the user knows given a word-list. Currently, the word-list path is hardcoded to be 'word-list.txt' is the root directory, see the analyse_known_words function defined in analysis.py. This information has also been added to the books.html page to display it to the user. An example word-list can be found in data/jlpt-word-list.txt.

    opened by christofferaakre 0
  • Add support for frequency lists

    Add support for frequency lists

    This pull request adds support for frequency lists. Frequency lists are put in the frequency-lists folder, and must have the same format as the ones that are currently there. The, when the user uploads an ebook, we find the frequency of every word in the book according to each frequency lists, and we also compute an overall frequency that takes into account all of them - Details about this can be found in the get_overall_frequency function defined in frequency_lists.js

    opened by christofferaakre 0
  • Categorize words by JLPT level

    Categorize words by JLPT level

    Using the IDs from jmdict entries found in tagainijisho (CSV files) and jmdict, it is possible to create a mapping from words to JLPT levels. This allows us to show the distribution of words by their JLPT level.

    I've already done such a mapping for German words in the jmdict, so I could provide a file with mappings or a script to create the mappings with.

    opened by vdrummer 0
  • Analyse sentences as well as words

    Analyse sentences as well as words

    Currently, we are only analysing individual words. If we also break the book up into sentences, we get access to some useful metrics like average sentence length etc. Two options seem feasible to me:

    1. Reconstruct the sentences by stringing together individual words until we hit a sentence-ending character like 。
    2. Maybe mecab has a thing that lets you break text up into sentences rather than words

    I think option 1) should be sufficient, as I can't really think of too many edge cases.

    enhancement help wanted good first issue 
    opened by christofferaakre 0
  • Fix furigana removal for .txt files

    Fix furigana removal for .txt files

    Currently, we use furigana4epub to remove furigana from .epub files, but we don't remove furigana from .txt files. I have been unable to find a suitbale library/tool to do this, so I tried to implement something myself: https://github.com/christofferaakre/japanese-ebook-analysis/commit/90707b1da313a5b95a3caf3bc9e2c0402c8399d1 Unfortunately, it doesn't seem to quite work.

    bug enhancement help wanted good first issue 
    opened by christofferaakre 0
  • Deploy the server somewhere on the web

    Deploy the server somewhere on the web

    Deploy the server somewhere on the web so that the user doesn't need to clone the repository, install dependencies, and then start the server themselves.

    enhancement help wanted good first issue 
    opened by christofferaakre 0
  • Make the app look nice with CSS/JS/whatever

    Make the app look nice with CSS/JS/whatever

    Currently, the service looks quite bad, and could look a lot better with some polish on the css/js/etc. The relevant files to look at are:

    • templates/books.html
    • templates/header.html
    • templates/upload_file.html
    • static/css/style.css
    • static/css/books.css
    • static/css/upload_file.css
    enhancement help wanted good first issue 
    opened by christofferaakre 0
Owner
Christoffer Aakre
Christoffer Aakre
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in spirit, but targets qutebrowser.

Jonas Belouadi 7 Nov 7, 2022
A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

WordDumb A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. Languages X-Ray supp

null 172 Dec 29, 2022
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

Roberto Sanchez 0 Aug 4, 2021
Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

Calibre Recipe für "Analyse & Kritik" Dies ist ein "Recipe" für die Konvertierung der aktuellen Ausgabe der Zeitung Analyse & Kritik in ein Ebook. Es

Henning 3 Jan 4, 2022
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 7, 2023
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
Code for evaluating Japanese pretrained models provided by NTT Ltd.

japanese-dialog-transformers 日本語の説明文はこちら This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialo

NTT Communication Science Laboratories 216 Dec 22, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 6, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

null 5 Oct 29, 2022
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

tanreinama 13 Aug 11, 2022