Some bits of javascript to transcribe scanned pages using PageXML

Overview

nashi (nasḫī)

Some bits of javascript to transcribe scanned pages using PageXML. Both ltr and rtl languages are supported. Try it! But wait, there's more: download now and get a complete webapp written in Python/Flask that handles import and export of your scanned pages to and from LAREX for semi-automatic layout analysis, does the line segmentation for you (via kraken) and saves your precious PageXML in a database. All you've got to do is follow the instructions below and help me implement all the missing features... OCR training and recognition is currently not included because of our webhost's limited capacity.

Instructions for nashi.html

  • Put nashi.html in a folder with (or some folder above) your PageXML files (containing line segmentation data) and the page images. Serve the folder in a webserver of your choice or simply use the file:// protocol (only supported in Firefox at the moment).
  • In the browser, open the interface as .../path/to/nashi.html?pagexml=Test.xml&direction=rtl where Test.xml (or subfolder/Test.xml) is one of the PageXML files and rtl (or ltr) indicates the main direction of your text.
  • Install the "Andron Scriptor Web" font to use the additional range of characters.

The interface

  • Lines without existing text are marked red, lines containing OCR data blue and lines already transcribed are coloured green.

Keyboard shortcuts in the text input area

  • Tab/Shift+Tab switches to the next/previous input.
  • Shift+Enter saves the edits for the current line.
  • Shift+Insert shows an additional range of characters to select as an alternative to the character next to the cursor. Input one of them using the corresponding number while holding Insert.
  • Shift+ArrowDown opens a new comment field (Shift+ArrowUp switches back to the transcription line).

Global keyboard shortcuts

  • Ctrl+Space Zooms in to line width
  • Ctrl+Shift+Space toggles zoom mode (always zoom in to line width)
  • Shift+PageUp/PageDown loads the next/previous page if the filenames of your PageXML files contain the number.
  • Ctrl+Shift+ArrowLeft/ArrowRight changes orientation and input direction to ltr/rtl.
  • Ctrl+S downloads the PageXML file.
  • Ctrl+E enters or exits polygon edit mode.

Edit mode

  • Click on line area to activate point handles. Points can be moved around using, new points can be created by drawing the borders between existing points.
  • If points or lines are active, they can be deleted using the "Delete"-key.
  • Hold Shift-key and draw to select multiple points
  • New text lines can be created by clicking inside an existing text region and drawing a rectangle. New lines are always added at the end of the region.

Instructions for the server

  • Install redis. The app uses celery as a task queue for line segmentation jobs (and probably OCR jobs in the future).
  • Install LAREX for semi-automatic layout analysis.
  • Install the server from this repository or from pypi:
pip install nashi
  • Create a config.py file. For more options see the file default_settings.py. If you want the app to send emails to users, change the mail settings there. Here is just a minimal example:
BOOKS_DIR = "/home/username/books/"
LAREX_DIR = "/home/username/larex_books/"
  • Set an environment variable containing your database url. If you don't, nashi will create a sqlite database called "test.db" in your working directory.
export DATABASE_URL="mysql+pymysql://user:pw@localhost/mydb?charset=utf8"
  • Create the database tables (and users, if needed) from a python prompt. Login is disabled in the default config file.
from nashi import user_datastore
from nashi.database import db_session, init_db
init_db()
user_datastore.create_user(email="[email protected]", password="secret")
db_session.commit()
  • Run the celery worker:
export NASHI_SETTINGS=/home/user/path/to/config.py
celery -A nashi.celery worker --loglevel=info
  • Run the app, don't forget to export your DATABASE_URl again if you're using a new terminal:
export FLASK_APP=nashi
export NASHI_SETTINGS=/home/user/path/to/config.py
flask run
  • Open localhost:5000, log in, update your books list via "Edit, Refresh".

Planned features

  • Sorting of lines
  • Reading order
  • Creation and correction of regions
  • API for external OCR service
  • Advanced text editing capabilities
  • Help, examples, and documentation
  • Artificial general intelligence that writes the code for me
You might also like...
~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.
~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

cosc428-structor I had an open-ended Computer Vision assignment to complete, and an out-of-copyright book that I wanted to turn into an ebook. Convent

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
Pure Javascript OCR for more than 100 Languages 📖🎉🖥

Version 2 is now available and under development in the master branch, read a story about v2: Why I refactor tesseract.js v2? Check the support/1.x br

Satoshi is a discord bot template in python using discord.py that allow you to track some live crypto prices with your own discord bot.

Satoshi ~ DiscordCryptoBot Satoshi is a simple python discord bot using discord.py that allow you to track your favorites cryptos prices with your own

Some codes from PyImageSearch course's and external projects.
Some codes from PyImageSearch course's and external projects.

👨‍💻 Some codes and projects 👨‍💻 💡 Technologies 📜 Projects 📍 Chrome Dinosaur Controller 📦 Script 📍 Coins Counter 📦 Script 🤓 Author Lucas Biv

Some Boring Research About Products Recognition 、Duplicate Img Detection、Img Stitch、OCR

Products Recognition 介绍 商品识别,围绕在复杂的商场零售场景中,识别出货架图像中的商品信息。主要组成部分: 重复图像检测。【更新进度 4/10】 图像拼接。【更新进度 0/10】 目标检测。【更新进度 0/10】 商品识别。【更新进度 1/10】 OCR。【更新进度 1/10】

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

About An OCR translator tool. Made by me by utilizing Tesseract, compiled to .exe using pyinstaller. I made this program to learn more about python. I

This project proposes a camera vision based cursor control system, using hand moment captured from a webcam through a landmarks of hand by using Mideapipe module

This project proposes a camera vision based cursor control system, using hand moment captured from a webcam through a landmarks of hand by using Mideapipe module

Perspective recovery of text using transformed ellipses

unproject_text Perspective recovery of text using transformed ellipses. See full writeup at https://mzucker.github.io/2016/10/11/unprojecting-text-wit

Text page dewarping using a "cubic sheet" model

page_dewarp Page dewarping and thresholding using a "cubic sheet" model - see full writeup at https://mzucker.github.io/2016/08/15/page-dewarping.html

Comments
  • Missing files in sdist

    Missing files in sdist

    It appears that the manifest is missing at least one file necessary to build from the sdist for version 0.0.35. You're in good company, about 5% of other projects updated in the last year are also missing files.

    + /tmp/venv/bin/pip3 wheel --no-binary nashi -w /tmp/ext nashi==0.0.35
    Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
    Collecting nashi==0.0.35
      Downloading http://10.10.0.139:9191/root/pypi/%2Bf/13a/32a6295258b7c/nashi-0.0.35.tar.gz (46 kB)
        ERROR: Command errored out with exit status 1:
         command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-egcssbf6/nashi/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-egcssbf6/nashi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-egcssbf6/nashi/pip-egg-info
             cwd: /tmp/pip-wheel-egcssbf6/nashi/
        Complete output (5 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-wheel-egcssbf6/nashi/setup.py", line 3, in <module>
            with open("../README.md") as f:
        FileNotFoundError: [Errno 2] No such file or directory: '../README.md'
        ----------------------------------------
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    
    opened by thatch 2
  • Fixed download shortcut (CTRL+s)

    Fixed download shortcut (CTRL+s)

    Fixed function call for the download XML shortcut (CTRL+s)

    download_shortcut

    Issue:

    • nashi will not download the PAGE XML with the CTRL+s shortcut described in the nashi README

    How to reproduce:

    • Open nashi and hit CTRL+s
    opened by Nesbi 0
  • Starting Nashi

    Starting Nashi

    I've cloned all the files from github to the computer. The instructions state that to use the file protocol Firefox must be utilized. I've tried to reach it but whenever I give in the protocol plus the path to Nashi, the page stays blank and nothing happens. Is there a step that's missing?

    opened by jpb-badw 1
Owner
Andreas Büttner
Andreas Büttner
Extract tables from scanned image PDFs using Optical Character Recognition.

ocr-table This project aims to extract tables from scanned image PDFs using Optical Character Recognition. Install Requirements Tesseract OCR sudo apt

Abhijeet Singh 209 Dec 6, 2022
A post-processing tool for scanned sheets of paper.

unpaper Originally written by Jens Gulden — see AUTHORS for more information. Licensed under GNU GPL v2 — see COPYING for more information. Overview u

null 27 Dec 7, 2022
Library used to deskew a scanned document

Deskew //Note: Skew is measured in degrees. Deskewing is a process whereby skew is removed by rotating an image by the same amount as its skew but in

Stéphane Brunner 273 Jan 6, 2023
Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Deskew by Marek Mauder https://galfar.vevb.net/deskew https://github.com/galfar/deskew v1.30 2019-06-07 Overview Deskew is a command line tool for des

Marek Mauder 127 Dec 3, 2022
Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

Jainam Shah 243 Dec 30, 2022
Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

Org. Account 165 Dec 31, 2022
A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

Maxim 32 Jul 24, 2022
Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

doc2text doc2text extracts higher quality text by fixing common scan errors Developing text corpora can be a massive pain in the butt. Much of the tex

Joe Sutherland 1.3k Jan 4, 2023
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf # it's a scriptable c

jbarlow83 7.9k Jan 3, 2023
Recognizing the text contents from a scanned visiting card

Recognizing the text contents from a scanned visiting card. The application which is used to recognize the text from scanned images,printeddocuments,r

Faizan Habib 1 Jan 28, 2022