An expandable and scalable OCR pipeline

Last update: Jan 4, 2023

Related tags

Computer Vision nidaba

Overview

Nidaba is the central controller for the entire OGL OCR pipeline. It oversees and automates the process of converting raw images into citable collections of digitized texts.

It offers the following functionality:

Grayscale Conversion
Binarization utilizing Sauvola adaptive thresholding, Otsu, or ocropus's nlbin algorithm
Deskewing
Dewarping
Integration of tesseract, kraken, and ocropus OCR engines
Page segmentation from the aforementioned OCR packages
Various postprocessing utilities like spell-checking, merging of multiple results, and ground truth comparison.

As it is designed to use a common storage medium on network attached storage and the celery distributed task queue it scales nicely to multi-machine clusters.

Build

To easiest way to install the latest stable(-ish) nidaba is from PyPi:

$ pip install nidaba

or run:

$ pip install .

in the git repository for the bleeding edge development version.

Some useful tasks have external dependencies. A good start is:

# apt-get install libtesseract3 tesseract-ocr-eng libleptonica-dev liblept

Tests

Per default no dictionaries and OCR models necessary to runs the tests are installed. To download the necessary files run:

$ python setup.py download

$ python setup.py nosetests

Tests for modules that call external programs, at the time only tesseract, ocropus, and kraken, will be skipped if these aren't installed.

Running

First edit (the installed) nidaba.yaml and celery.yaml to fit your needs. Have a look at the docs if you haven't set up a celery-based application before.

Then start up the celery daemon with something like:

$ celery -A nidaba worker

Next jobs can be added to the pipeline using the nidaba executable:

$ nidaba batch -b otsu -l tesseract -o tesseract:eng -- ./input.tiff
Preparing filestore             [✓]
Building batch                  [✓]
951c57e5-f8a0-432d-8d77-8a2e27fff53c

Using the return code the current state of the job can be retrieved:

$ nidaba status 25d79a54-9d4a-4939-acb6-8e168d6dbc7c
PENDING

When the job has been processed the status command will return a list of paths containing the final output:

$ nidaba status 951c57e5-f8a0-432d-8d77-8a2e27fff53c
SUCCESS
14.tif → .../input_img.rgb_to_gray_binarize.otsu_ocr.tesseract_grc.tif.hocr

Documentation

Want to learn more? Read the Docs

Comments

cannot install with python 3.7

  Building wheel for pyxDamerauLevenshtein (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/nidaba/nidaba/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-caa2lwvy/pyxDamerauLevenshtein/setup.py'"'"'; __file__='"'"'/tmp/pip-install-caa2lwvy/pyxDamerauLevenshtein/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-wc90vya7
       cwd: /tmp/pip-install-caa2lwvy/pyxDamerauLevenshtein/

[...]

/home/nidaba/nidaba/lib/python3.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
   #warning "Using deprecated NumPy API, disable it with " \
    ^~~~~~~
  pyxdameraulevenshtein/pyxdameraulevenshtein.c: In function ‘__Pyx_GetException’:
  pyxdameraulevenshtein/pyxdameraulevenshtein.c:5209:24: error: ‘PyThreadState’ {aka ‘struct _ts’} has no member named ‘exc_type’; did you mean ‘curexc_type’?
       tmp_type = tstate->exc_type;

====

Any idea?

opened by yurj 4

Error: Invalid value for "--ocr" / "-o": Positional arguments are deprecated!
When I use the following command to initialize a nidaba batch

nidaba batch -b nlbin:threshold=0.5,zoom=0.5,escale=1.0,border=0.1,perc=80,range=20,low=5,high=90 -l tesseract -o kraken:greek -p spell_check:polytonic_greek phaistos/OCR/OGL/septuagint-dev/raw_ocr/hvd.swete.1.1901.kraken/*.pbm.png

I get the following error message:

Usage: nidaba batch [OPTIONS] FILES... Error: Invalid value for "--ocr" / "-o": Positional arguments are deprecated!

Has something changed with the API? This command worked without a problem before. I am using nidaba 0.9.3 and kraken 0.4.2 You can check out my nidaba.yaml file on Homer at /home/mmunson/envs/nidaba/etc/nidaba/nidaba.yaml This is also no longer working on the Tufts cluster where it worked before.
opened by sonofmun 4

pip install . problem

When installing with pip install ., I get the following error message:

Downloading/unpacking pyxDamerauLevenshtein==1.3.1 (from nidaba==0.3.14)
  Downloading pyxDamerauLevenshtein-1.3.1.tar.gz (51kB): 51kB downloaded
  Running setup.py (path:/cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein/setup.py) egg_info for package pyxDamerauLevenshtein
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein/setup.py", line 28, in <module>
        import numpy
    ImportError: No module named numpy
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein/setup.py", line 28, in <module>

    import numpy

ImportError: No module named numpy

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein
Storing debug log for failure in /cluster/home/mmunso01/.pip/pip.log

By running pip install numpy first and then running pip install ., I was able to overcome this problem. I also had this problem with PyTables so it could be that pyxDamerauLevenshtein needs to be detached and installed later (or numpy detached and installed earlier).

opened by sonofmun 4

Can't download dictionaries/models
When I run python setup.py download, I get an invalid command error:

python setup.py download usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...] or: setup.py --help [cmd1 cmd2 ...] or: setup.py --help-commands or: setup.py cmd --help error: invalid command 'download'

Additionally, trying to download these dependencies manually, accessing http://l.unchti.me/nidaba/MANIFEST results in a 404 not found error.
opened by ryanfb 4

Binarization fails

When I run the command

nidaba batch --binarize sauvola:10,20,30,40 --ocr tesseract:grc+eng --willitblend -- /home/mmunson/ddd/extracted_books/uc1.b4034434/*.png

I get the following error message for every image:

Error in pixSauvolaBinarize: whsize too large for image
[2015-05-07 14:41:31,041: ERROR/MainProcess] Task nidaba.binarize.sauvola[28767f13-0328-4efb-8585-083f56555a25] raised unexpected: NidabaLeptonicaException('Binarization failed for unknownreason.',)
Traceback (most recent call last):
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/nidaba/tasks/helper.py", line 61, in __call__
    ret = super(NidabaTask, self).__call__(*args, **nkwargs)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/nidaba/plugins/leptonica.py", line 69, in sauvola
    lept_sauvola(input_path, output_path, whsize, factor)
  File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/nidaba/plugins/leptonica.py", line 111, in lept_sauvola
    raise NidabaLeptonicaException('Binarization failed for unknown'
NidabaLeptonicaException: Binarization failed for unknownreason.

opened by sonofmun 3

File required for tests doesn't exist anymore

From docs:

Per default no dictionaries and OCR models necessary to runs the tests are installed. To download the necessary files run:

$ python setup.py download Afterwards, the test suite can be run:

$ python setup.py nosetests

python setup.py download tries to load archive http://l.unchti.me/nidaba/tests.tar.bz2, but the link is broken

opened by vlivashkin 2

Sending multiple languages to tesseract

Is it still possible to send multiple languages to tesseract ocr? When I use

-o tesseract:languages=grc+eng,extended=True

as a switch, I get the error message

File "/cluster/home/mmunso01/envs/nidaba/bin/nidaba", line 11, in <module>
    sys.exit(main())
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 700, in __call__
    return self.main(*args, **kwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 680, in main
    rv = self.invoke(ctx)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 1027, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 873, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 508, in invoke
    return callback(*args, **kwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/cli.py", line 240, in batch
    batch.add_task('ocr', alg[0], **kwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/nidaba.py", line 662, in add_task
    task_arg_validator(task.get_valid_args(), **kwargs)
  File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/nidaba.py", line 72, in task_arg_validator
    raise NidabaInputException('{} not in list of valid values'.format(val))
nidaba.nidabaexceptions.NidabaInputException: grc+eng not in list of valid values

I see that the languages are supposed to be a list in the documentation, but I am not sure how to get nidaba to recognize a list of languages from the command line.

opened by sonofmun 2

Can't prepare filestore

When I use the following command

nidaba batch -b nlbin:threshold=0.5,zoom=0.5,escale=1.0,border=0.1,perc=80,range=20,low=5,high=90 -l tesseract -o kraken:model=greek -p spell_check:dict=polytonic_greek -- /home/mmunson/phaistos/OCR/in_progress/OCR/hvd.swete.1.1901.kraken/*.png

I get the following response

Preparing filestore             [✗]

And then the job exits. The permissions on the the destination folder (phaistos/OCR/in-progress/OCR) are 777, so it doesn't appear to be a permissions problem.

opened by sonofmun 2

Doesn't produce output
When I run this command:

nidaba batch -b otsu -l tesseract -o tesseract:grc+eng,extended=True -p spell_check:polytonic_greek -- /cluster/tufts/perseus_ocr/nidaba/teubner/ammonius_1966/*.tif

It produces the filestore and it gets through the rgb-to-gray and the binarization steps, but then it seems to hang and does not produce any OCR output. When I request the status, it says that it is pending. I will send my celery log file by email since I don't know how to attach it to this issue.
opened by sonofmun 2
When segmentation jobs fail, batch crashes

This may also be the case with other jobs failing. But I have noticed that as soon as I get a NidabaTesseractException because, I think, Tesseract segmentation craps out on empty pages. At this point, it appears that the segmentation jobs that are already in the queue finish, but the jobs after the segmentation jobs do not even start, even for the pages that did not get a segmentation error.

opened by sonofmun 1

Problem with spell checking

When I run the command

nidaba batch -b nlbin:threshold=0.5,zoom=0.5,escale=1.0,border=0.1,perc=80,range=20,low=5,high=90 -l tesseract -o kraken:model=migne -o kraken:model=migne-njp -o kraken:model=omnibus -p spell_check:language=polytonic_greek,filter_punctuation=False -f tei2txt -- *.png

on Homer, I get the error:

postprocessing.spell_check (2, n): AttributeError: 'list' object has no attribute 'startswith'

opened by sonofmun 0

Catalog integration groundwork

Right now no catalog integration is completely missing, except the rather basic metadata task. Let's think about how to generate MODS/MADS records for batches or TEI metadata from MADS/MODS records or some weird combination of the above.

opened by mittagessen 0
Need unified terms for language models

When using Tesseract, the languages used are designated by the "languages" keyword, whereas with kraken/OCRopus they are designated as models. Perhaps this isn't a big deal, but I think it would be nice to simply have a single keyword here so that I can type tesseract:model=grc and kraken:model=grc and have them both work.

opened by sonofmun 0
Spell-checker produces correction candidates for valid words

When using Kraken and the spell-checker flag, the spell-checker produces correction candidates even for words that should be in the dictionary, e.g., ἀλλὰ.

opened by sonofmun 0

Owner

GitHub

It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

4 Jul 11, 2022

Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

5 Dec 6, 2021

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

16.7k Jan 3, 2023

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Open Semantic Search https://opensemanticsearch.org Integrated search server, ETL framework for document processing (crawling, text extraction, text a

684 Jan 6, 2023

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

311 Dec 24, 2022

Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

496 Jan 5, 2023

A pure pytorch implemented ocr project including text detection and recognition

ocr.pytorch A pure pytorch implemented ocr project. Text detection is based CTPN and text recognition is based CRNN. More detection and recognition me

444 Dec 30, 2022

A set of workflows for corpus building through OCR, post-correction and normalisation

PICCL: Philosophical Integrator of Computational and Corpus Libraries PICCL offers a workflow for corpus building and builds on a variety of tools. Th

41 Dec 27, 2022

MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9

99 Nov 1, 2022

ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

VistaOCR ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data Publications "How to Efficiently Increase Resolutio

ISI Center for Vision, Image, Speech, and Text Analytics

21 Dec 8, 2021

Python-based tools for document analysis and OCR

ocropy OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do so

3.2k Dec 31, 2022

CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

简介基于Tensorflow和Keras实现端到端的不定长中文字符检测和识别文本检测：CTPN 文本识别：DenseNet + CTC 环境部署 sh setup.sh 注：CPU环境执行前需注释掉for gpu部分，并解开for cpu部分的注释 Demo 将测试图片放入test_images

2.6k Dec 29, 2022

Solution for Problem 1 by team codesquad for AIDL 2020. Uses ML Kit for OCR and OpenCV for image processing

CodeSquad PS1 Solution for Problem Statement 1 for AIDL 2020 conducted by @unifynd technologies. Problem Given images of bills/invoices, the task was

111 Nov 27, 2022

A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).

OCR Resources This repository contains a collection of resources (including the papers and datasets) of OCR (Optical Character Recognition). Contents

363 Jan 3, 2023

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

ocr-fileformat Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader) Installation Docker System-wide Usage CLI GUI API Transf

152 Dec 20, 2022

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

285 Dec 8, 2022

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

doc2text doc2text extracts higher quality text by fixing common scan errors Developing text corpora can be a massive pain in the butt. Much of the tex

1.3k Jan 4, 2023

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

About An OCR translator tool. Made by me by utilizing Tesseract, compiled to .exe using pyinstaller. I made this program to learn more about python. I

41 Dec 30, 2022

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Quick and Dirty OCR of Facebook Papers Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review. As lu

2 Oct 28, 2021