An expandable and scalable OCR pipeline

Overview

Overview

Nidaba is the central controller for the entire OGL OCR pipeline. It oversees and automates the process of converting raw images into citable collections of digitized texts.

It offers the following functionality:

  • Grayscale Conversion
  • Binarization utilizing Sauvola adaptive thresholding, Otsu, or ocropus's nlbin algorithm
  • Deskewing
  • Dewarping
  • Integration of tesseract, kraken, and ocropus OCR engines
  • Page segmentation from the aforementioned OCR packages
  • Various postprocessing utilities like spell-checking, merging of multiple results, and ground truth comparison.

As it is designed to use a common storage medium on network attached storage and the celery distributed task queue it scales nicely to multi-machine clusters.

Build

To easiest way to install the latest stable(-ish) nidaba is from PyPi:

$ pip install nidaba

or run:

$ pip install .

in the git repository for the bleeding edge development version.

Some useful tasks have external dependencies. A good start is:

# apt-get install libtesseract3 tesseract-ocr-eng libleptonica-dev liblept

Tests

Per default no dictionaries and OCR models necessary to runs the tests are installed. To download the necessary files run:

$ python setup.py download
$ python setup.py nosetests

Tests for modules that call external programs, at the time only tesseract, ocropus, and kraken, will be skipped if these aren't installed.

Running

First edit (the installed) nidaba.yaml and celery.yaml to fit your needs. Have a look at the docs if you haven't set up a celery-based application before.

Then start up the celery daemon with something like:

$ celery -A nidaba worker

Next jobs can be added to the pipeline using the nidaba executable:

$ nidaba batch -b otsu -l tesseract -o tesseract:eng -- ./input.tiff
Preparing filestore             [✓]
Building batch                  [✓]
951c57e5-f8a0-432d-8d77-8a2e27fff53c

Using the return code the current state of the job can be retrieved:

$ nidaba status 25d79a54-9d4a-4939-acb6-8e168d6dbc7c
PENDING

When the job has been processed the status command will return a list of paths containing the final output:

$ nidaba status 951c57e5-f8a0-432d-8d77-8a2e27fff53c
SUCCESS
14.tif → .../input_img.rgb_to_gray_binarize.otsu_ocr.tesseract_grc.tif.hocr

Documentation

Want to learn more? Read the Docs

Comments
  • Errors with Kraken

    Errors with Kraken

    When I use the following command:

    nidaba batch -b otsu -l tesseract -o kraken:grc_teubner -p spell_check:polytonic_greek -- /cluster/tufts/perseus_ocr/nidaba/teubner/ammonius_1966/*.tif
    

    I get the following error showing up in the celery log for every page, i.e., every page fails with the same error

    [2015-08-13 10:32:52,073: ERROR/MainProcess] Task nidaba.ocr.kraken[00db4d89-a411-4098-9036-9865acabe112] raised unexpected: AttributeError("'NoneType' object has no attribute 'predictString'",)
    Traceback (most recent call last):
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
        R = retval = fun(*args, **kwargs)
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/tasks/helper.py", line 81, in __call__
        ret = super(NidabaTask, self).__call__(*args, **nkwargs)
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
        return self.run(*args, **kwargs)
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/plugins/kraken.py", line 123, in ocr_kraken
        for rec in rpred.rpred(rnn, img, [(int(x[0]), int(x[1]), int(x[2]), int(x[3])) for x in lines]):
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/kraken/rpred.py", line 132, in rpred
        pred = network.predictString(line)
    AttributeError: 'NoneType' object has no attribute 'predictString'
    

    nidaba.yaml looks like this:

    # The home directory for Iris to store files created by OCR jobs. For example,
    # tifs, jp2s, meta.xml, and abbyy file downloaded from archive.org are stored
    # here. Each new job is automatically placed in a uniquely named directory.
    storage_path: /cluster/tufts/perseus_ocr/nidaba/OCR/
    
    # URL to the redis database. May be shared with celery.
    redis_url: 'redis://127.0.0.1:6379'
    
    # Spell check configuration. Dictionaries are kept on the common medium (i.e.
    # at STORAGE_PATH/tuple[0]/tuple[1]). Each spell checker requires a list of
    # valid words ('dictionary') and a dictionary containing all variants of words
    # attained by deletion of single characters (see nidaba.lex.make_deldict).
    lang_dicts:
      polytonic_greek: {dictionary: [dicts, greek.dic],
                        deletion_dictionary: [dicts, del_greek.dic]}
      latin: {dictionary: [dicts, latin.dic],
                        deletion_dictionary: [dicts, del_latin.dic]}
    
    # Ocropus/kraken models
    ocropus_models:
      greek: [models, omnibus-2014-05-31-10-16-00087000.pyrnn.gz]
      grc_teubner: [models, teubner-serif-2013-12-16-11-26-00067000.pyrnn.gz]
      atlantean: [models, atlantean.pyrnn.gz]
      fraktur: [models, fraktur.pyrnn.gz]
      fancy_ligatures: [models, ligatures.pyrnn.gz]
    
    # Models solely working with kraken (i.e. models in HDF5 format).
    kraken_models:
      default: [models, en-default.hdf5]
    
    
    # List of plugins to load. Additional fields in the associative array will be
    # handed over to the setup function of the plugin.  Be aware that plugins
    # utilizing external components that aren't installed will cause nidaba to
    # abort. 
    plugins_load:
      tesseract: {implementation: capi, # set to either legacy (hOCR
                                                  # output in an *.html file),
                                                  # direct (hOCR output in an
                                                  # *.hocr file), or capi
                                                  # (tesseract version >= 3.02)
                 tessdata: /cluster/tufts/perseus_ocr_code/tesseract/tessdata} # location of the tessdata
                                                     # path. May also be a storage
                                                     # tuple.
      #ocropus: {}
      kraken: {}
      #leptonica: {}
    
    opened by sonofmun 9
  • cannot install with python 3.7

    cannot install with python 3.7

      Building wheel for pyxDamerauLevenshtein (setup.py) ... error
      ERROR: Command errored out with exit status 1:
       command: /home/nidaba/nidaba/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-caa2lwvy/pyxDamerauLevenshtein/setup.py'"'"'; __file__='"'"'/tmp/pip-install-caa2lwvy/pyxDamerauLevenshtein/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-wc90vya7
           cwd: /tmp/pip-install-caa2lwvy/pyxDamerauLevenshtein/
    

    [...]

    /home/nidaba/nidaba/lib/python3.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
       #warning "Using deprecated NumPy API, disable it with " \
        ^~~~~~~
      pyxdameraulevenshtein/pyxdameraulevenshtein.c: In function ‘__Pyx_GetException’:
      pyxdameraulevenshtein/pyxdameraulevenshtein.c:5209:24: error: ‘PyThreadState’ {aka ‘struct _ts’} has no member named ‘exc_type’; did you mean ‘curexc_type’?
           tmp_type = tstate->exc_type;
    

    ====

    Any idea?

    opened by yurj 4
  • Error: Invalid value for

    Error: Invalid value for "--ocr" / "-o": Positional arguments are deprecated!

    When I use the following command to initialize a nidaba batch

    nidaba batch -b nlbin:threshold=0.5,zoom=0.5,escale=1.0,border=0.1,perc=80,range=20,low=5,high=90 -l tesseract -o kraken:greek -p spell_check:polytonic_greek phaistos/OCR/OGL/septuagint-dev/raw_ocr/hvd.swete.1.1901.kraken/*.pbm.png
    

    I get the following error message:

    Usage: nidaba batch [OPTIONS] FILES...
    
    Error: Invalid value for "--ocr" / "-o": Positional arguments are deprecated!
    

    Has something changed with the API? This command worked without a problem before. I am using nidaba 0.9.3 and kraken 0.4.2 You can check out my nidaba.yaml file on Homer at /home/mmunson/envs/nidaba/etc/nidaba/nidaba.yaml This is also no longer working on the Tufts cluster where it worked before.

    opened by sonofmun 4
  • pip install . problem

    pip install . problem

    When installing with pip install ., I get the following error message:

    Downloading/unpacking pyxDamerauLevenshtein==1.3.1 (from nidaba==0.3.14)
      Downloading pyxDamerauLevenshtein-1.3.1.tar.gz (51kB): 51kB downloaded
      Running setup.py (path:/cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein/setup.py) egg_info for package pyxDamerauLevenshtein
        Traceback (most recent call last):
          File "<string>", line 17, in <module>
          File "/cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein/setup.py", line 28, in <module>
            import numpy
        ImportError: No module named numpy
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
    
      File "<string>", line 17, in <module>
    
      File "/cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein/setup.py", line 28, in <module>
    
        import numpy
    
    ImportError: No module named numpy
    
    ----------------------------------------
    Cleaning up...
    Command python setup.py egg_info failed with error code 1 in /cluster/home/mmunso01/envs/nidaba/build/pyxDamerauLevenshtein
    Storing debug log for failure in /cluster/home/mmunso01/.pip/pip.log
    

    By running pip install numpy first and then running pip install ., I was able to overcome this problem. I also had this problem with PyTables so it could be that pyxDamerauLevenshtein needs to be detached and installed later (or numpy detached and installed earlier).

    opened by sonofmun 4
  • Can't download dictionaries/models

    Can't download dictionaries/models

    When I run python setup.py download, I get an invalid command error:

    python setup.py download
    usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
       or: setup.py --help [cmd1 cmd2 ...]
       or: setup.py --help-commands
       or: setup.py cmd --help
    
    error: invalid command 'download'
    

    Additionally, trying to download these dependencies manually, accessing http://l.unchti.me/nidaba/MANIFEST results in a 404 not found error.

    opened by ryanfb 4
  • Binarization fails

    Binarization fails

    When I run the command

    nidaba batch --binarize sauvola:10,20,30,40 --ocr tesseract:grc+eng --willitblend -- /home/mmunson/ddd/extracted_books/uc1.b4034434/*.png
    

    I get the following error message for every image:

    Error in pixSauvolaBinarize: whsize too large for image
    [2015-05-07 14:41:31,041: ERROR/MainProcess] Task nidaba.binarize.sauvola[28767f13-0328-4efb-8585-083f56555a25] raised unexpected: NidabaLeptonicaException('Binarization failed for unknownreason.',)
    Traceback (most recent call last):
      File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
        R = retval = fun(*args, **kwargs)
      File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/nidaba/tasks/helper.py", line 61, in __call__
        ret = super(NidabaTask, self).__call__(*args, **nkwargs)
      File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
        return self.run(*args, **kwargs)
      File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/nidaba/plugins/leptonica.py", line 69, in sauvola
        lept_sauvola(input_path, output_path, whsize, factor)
      File "/home/mmunson/envs/nidaba/local/lib/python2.7/site-packages/nidaba/plugins/leptonica.py", line 111, in lept_sauvola
        raise NidabaLeptonicaException('Binarization failed for unknown'
    NidabaLeptonicaException: Binarization failed for unknownreason.
    
    opened by sonofmun 3
  • File required for tests doesn't exist anymore

    File required for tests doesn't exist anymore

    From docs:

    Per default no dictionaries and OCR models necessary to runs the tests are installed. To download the necessary files run:

    $ python setup.py download Afterwards, the test suite can be run:

    $ python setup.py nosetests

    python setup.py download tries to load archive http://l.unchti.me/nidaba/tests.tar.bz2, but the link is broken

    opened by vlivashkin 2
  • Sending multiple languages to tesseract

    Sending multiple languages to tesseract

    Is it still possible to send multiple languages to tesseract ocr? When I use

    -o tesseract:languages=grc+eng,extended=True
    

    as a switch, I get the error message

    File "/cluster/home/mmunso01/envs/nidaba/bin/nidaba", line 11, in <module>
        sys.exit(main())
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 700, in __call__
        return self.main(*args, **kwargs)
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 680, in main
        rv = self.invoke(ctx)
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 1027, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 873, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/click/core.py", line 508, in invoke
        return callback(*args, **kwargs)
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/cli.py", line 240, in batch
        batch.add_task('ocr', alg[0], **kwargs)
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/nidaba.py", line 662, in add_task
        task_arg_validator(task.get_valid_args(), **kwargs)
      File "/cluster/home/mmunso01/envs/nidaba/lib/python2.7/site-packages/nidaba/nidaba.py", line 72, in task_arg_validator
        raise NidabaInputException('{} not in list of valid values'.format(val))
    nidaba.nidabaexceptions.NidabaInputException: grc+eng not in list of valid values
    

    I see that the languages are supposed to be a list in the documentation, but I am not sure how to get nidaba to recognize a list of languages from the command line.

    opened by sonofmun 2
  • Can't prepare filestore

    Can't prepare filestore

    When I use the following command

    nidaba batch -b nlbin:threshold=0.5,zoom=0.5,escale=1.0,border=0.1,perc=80,range=20,low=5,high=90 -l tesseract -o kraken:model=greek -p spell_check:dict=polytonic_greek -- /home/mmunson/phaistos/OCR/in_progress/OCR/hvd.swete.1.1901.kraken/*.png
    

    I get the following response

    Preparing filestore             [✗]
    

    And then the job exits. The permissions on the the destination folder (phaistos/OCR/in-progress/OCR) are 777, so it doesn't appear to be a permissions problem.

    opened by sonofmun 2
  • Doesn't produce output

    Doesn't produce output

    When I run this command:

    nidaba batch -b otsu -l tesseract -o tesseract:grc+eng,extended=True -p spell_check:polytonic_greek -- /cluster/tufts/perseus_ocr/nidaba/teubner/ammonius_1966/*.tif
    

    It produces the filestore and it gets through the rgb-to-gray and the binarization steps, but then it seems to hang and does not produce any OCR output. When I request the status, it says that it is pending. I will send my celery log file by email since I don't know how to attach it to this issue.

    opened by sonofmun 2
  • When segmentation jobs fail, batch crashes

    When segmentation jobs fail, batch crashes

    This may also be the case with other jobs failing. But I have noticed that as soon as I get a NidabaTesseractException because, I think, Tesseract segmentation craps out on empty pages. At this point, it appears that the segmentation jobs that are already in the queue finish, but the jobs after the segmentation jobs do not even start, even for the pages that did not get a segmentation error.

    opened by sonofmun 1
  • Problem with spell checking

    Problem with spell checking

    When I run the command

    nidaba batch -b nlbin:threshold=0.5,zoom=0.5,escale=1.0,border=0.1,perc=80,range=20,low=5,high=90 -l tesseract -o kraken:model=migne -o kraken:model=migne-njp -o kraken:model=omnibus -p spell_check:language=polytonic_greek,filter_punctuation=False -f tei2txt -- *.png
    

    on Homer, I get the error:

    postprocessing.spell_check (2, n): AttributeError: 'list' object has no attribute 'startswith'
    
    opened by sonofmun 0
  • Catalog integration groundwork

    Catalog integration groundwork

    Right now no catalog integration is completely missing, except the rather basic metadata task. Let's think about how to generate MODS/MADS records for batches or TEI metadata from MADS/MODS records or some weird combination of the above.

    opened by mittagessen 0
  • Need unified terms for language models

    Need unified terms for language models

    When using Tesseract, the languages used are designated by the "languages" keyword, whereas with kraken/OCRopus they are designated as models. Perhaps this isn't a big deal, but I think it would be nice to simply have a single keyword here so that I can type tesseract:model=grc and kraken:model=grc and have them both work.

    opened by sonofmun 0
  • Spell-checker produces correction candidates for valid words

    Spell-checker produces correction candidates for valid words

    When using Kraken and the spell-checker flag, the spell-checker produces correction candidates even for words that should be in the dictionary, e.g., ἀλλὰ.

    opened by sonofmun 0
Owner
null
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

Jaided AI 16.7k Jan 3, 2023
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023
A pure pytorch implemented ocr project including text detection and recognition

ocr.pytorch A pure pytorch implemented ocr project. Text detection is based CTPN and text recognition is based CRNN. More detection and recognition me

coura 444 Dec 30, 2022
A set of workflows for corpus building through OCR, post-correction and normalisation

PICCL: Philosophical Integrator of Computational and Corpus Libraries PICCL offers a workflow for corpus building and builds on a variety of tools. Th

Language Machines 41 Dec 27, 2022
MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9

Deep Insight 99 Nov 1, 2022
ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

VistaOCR ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data Publications "How to Efficiently Increase Resolutio

ISI Center for Vision, Image, Speech, and Text Analytics 21 Dec 8, 2021
Python-based tools for document analysis and OCR

ocropy OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do so

OCRopus 3.2k Dec 31, 2022
CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

简介 基于Tensorflow和Keras实现端到端的不定长中文字符检测和识别 文本检测:CTPN 文本识别:DenseNet + CTC 环境部署 sh setup.sh 注:CPU环境执行前需注释掉for gpu部分,并解开for cpu部分的注释 Demo 将测试图片放入test_images

Yang Chenguang 2.6k Dec 29, 2022
Solution for Problem 1 by team codesquad for AIDL 2020. Uses ML Kit for OCR and OpenCV for image processing

CodeSquad PS1 Solution for Problem Statement 1 for AIDL 2020 conducted by @unifynd technologies. Problem Given images of bills/invoices, the task was

Burhanuddin Udaipurwala 111 Nov 27, 2022
A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).

OCR Resources This repository contains a collection of resources (including the papers and datasets) of OCR (Optical Character Recognition). Contents

Zuming Huang 363 Jan 3, 2023
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

ocr-fileformat Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader) Installation Docker System-wide Usage CLI GUI API Transf

Universitätsbibliothek Mannheim 152 Dec 20, 2022
Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

OCRopus 285 Dec 8, 2022
Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

doc2text doc2text extracts higher quality text by fixing common scan errors Developing text corpora can be a massive pain in the butt. Much of the tex

Joe Sutherland 1.3k Jan 4, 2023
A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

About An OCR translator tool. Made by me by utilizing Tesseract, compiled to .exe using pyinstaller. I made this program to learn more about python. I

Fauzan F A 41 Dec 30, 2022
Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Quick and Dirty OCR of Facebook Papers Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review. As lu

Bill Fitzgerald 2 Oct 28, 2021