An OCR evaluation tool

Overview

dinglehopper

dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report.

Build Status

Goals

  • Useful
    • As a UI tool
    • For an automated evaluation
    • As a library
  • Unicode support

Installation

It's best to use pip, e.g.:

sudo pip install .

Usage

Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]

  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results.
  In that case, use --no-metrics to disable the then meaningless metrics and
  also change the color scheme from green/red to blue.

  The comparison report will be written to $REPORT_PREFIX.{html,json}, where
  $REPORT_PREFIX defaults to "report". The reports include the character
  error rate (CER) and the word error rate (WER).

  By default, the text of PAGE files is extracted on 'region' level. You may
  use "--textequiv-level line" to extract from the level of TextLine tags.

Options:
  --metrics / --no-metrics  Enable/disable metrics and green/red
  --textequiv-level LEVEL   PAGE TextEquiv level to extract text from
  --progress                Show progress bar
  --help                    Show this message and exit.

For example:

dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml

This generates report.html and report.json.

dinglehopper displaying metrics and character differences

dinglehopper-extract

The tool dinglehopper-extract extracts the text of the given input file on stdout, for example:

dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml

OCR-D

As a OCR-D processor:

ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL

This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL filegroup.

The OCR-D processor has these parameters:

Parameter Meaning
-P metrics false Disable metrics and the green-red color scheme (default: enabled)
-P textequiv_level line (PAGE) Extract text from TextLine level (default: TextRegion level)

For example:

ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false

Developer information

Please refer to README-DEV.md.

Comments
  • Support comparing line GT directories with line OCR directories

    Support comparing line GT directories with line OCR directories

    In #62, @stweil's original problem was - as I understand it - to compare a directory with line GT text files with a directory of line OCR text files. For now I've created fake test data to implement this fake-line-gt.zip. It looks like this:

    % ls *
    gt:
    line001.gt.txt  line003.gt.txt  line005.gt.txt  line007.gt.txt  line009.gt.txt  line011.gt.txt
    line002.gt.txt  line004.gt.txt  line006.gt.txt  line008.gt.txt  line010.gt.txt
    
    some-ocr:
    line001.some-ocr.txt  line003.some-ocr.txt  line005.some-ocr.txt  line007.some-ocr.txt  line009.some-ocr.txt  line011.some-ocr.txt
    line002.some-ocr.txt  line004.some-ocr.txt  line006.some-ocr.txt  line008.some-ocr.txt  line010.some-ocr.txt
    

    A first implementation should compare the text of pairs files (matching by filename) and produce a report of metrics & differences over all of the lines. First idea of the CLI interface:

    dinglehopper-lines gt/ --gt-suffix .gt.txt some-ocr/ --ocr-suffix .some-ocr.txt
    

    I'm not sure if this will be the final CLI interface but it's what seems necessary on first glance.

    enhancement 
    opened by mikegerber 22
  • Switch from custom Levenshtein to python-Levenshtein

    Switch from custom Levenshtein to python-Levenshtein

    As the distance and editops calculation is a performance bottleneck in this application I substituted the custom Levenshtein implementation to the C implementation in the python-Levenshtein package.

    We now also have separate entrypoints for texts with unicode normalization and without. For example when calculating the flexible character accuracy in #47 the normalization can be done more efficiently once upon preprocessing.

    Here are some benchmarks for the behaviour before and after this change using data already available in the repository:

    | Command | Mean [s] | Min [s] | Max [s] | Relative | |:---|---:|---:|---:|---:| | before brochrnx_73075507X | 3.359 ± 0.039 | 3.305 | 3.408 | 8.29 ± 0.13 | | after brochrnx_73075507X | 0.405 ± 0.004 | 0.399 | 0.409 | 1.00 | | before actevedef_718448162 CALAMARI | 34.410 ± 0.561 | 33.918 | 35.362 | 84.91 ± 1.61 | | after actevedef_718448162 CALAMARI | 0.926 ± 0.010 | 0.911 | 0.935 | 2.28 ± 0.03 | | before actevedef_718448162 TESS | 34.103 ± 0.305 | 33.685 | 34.529 | 84.16 ± 1.11 | | after actevedef_718448162 TESS | 0.909 ± 0.008 | 0.899 | 0.921 | 2.24 ± 0.03 |

    Boxplot generated by hyperfine

    Note that with this changes we can only compare unicode documents with a total maximum of 1 114 111 unique characters (or words). The limit comes from Python's chr function. But I suspect this should not be an issue at the moment or the near future.

    opened by b2m 20
  • Sort textlines with missing indices

    Sort textlines with missing indices

    Python's sorted method will fail with a TypeError when called with None and Integers:

    >>> sorted([None, 1])
    TypeError: '<' not supported between instances of 'int' and 'NoneType'
    

    Therefore we are using float('inf') instead of None in case of missing textline indices.

    opened by b2m 11
  • dinglehopper keep hanging and test errors

    dinglehopper keep hanging and test errors

    running dinglehopper gt txt and dinglehopper-line-dirs keep hanging without message, and pytest returns errors:

    collected 62 items / 18 deselected / 44 selected                                                   
    
    qurator/dinglehopper/tests/extracted_text_test.py .............                              [ 29%]
    qurator/dinglehopper/tests/test_align.py .......F..                                          [ 52%]
    qurator/dinglehopper/tests/test_character_error_rate.py ..                                   [ 56%]
    qurator/dinglehopper/tests/test_edit_distance.py .                                           [ 59%]
    qurator/dinglehopper/tests/test_editops.py ..                                                [ 63%]
    qurator/dinglehopper/tests/test_ocr_files.py .............                                   [ 93%]
    qurator/dinglehopper/tests/test_word_error_rate.py ...                                       [100%]
    
    ============================================= FAILURES =============================================
    __________________________________ test_with_some_fake_ocr_errors __________________________________
    
        def test_with_some_fake_ocr_errors():
    >       result = list(
                align(
                    "Über die vielen Sorgen wegen desselben vergaß",
                    "SomeJunk MoreJunk Übey die vielen Sorgen wegen AdditionalJunk deffelben vcrgab",
                )
            )
    
    qurator/dinglehopper/tests/test_align.py:70: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    s1 = ['Ü', 'b', 'e', 'r', ' ', 'd', ...], s2 = ['S', 'o', 'm', 'e', 'J', 'u', ...]
    
        def seq_align(s1, s2):
            """Align general sequences."""
            s1 = list(s1)
            s2 = list(s2)
            ops = levenshtein_editops(s1, s2)
            i = 0
            j = 0
        
            while i < len(s1) or j < len(s2):
                o = None
                try:
                    ot = ops[0]
                    if ot[1] == i and ot[2] == j:
                        ops = ops[1:]
                        o = ot
                except IndexError:
                    pass
        
                if o:
                    if o[0] == "insert":
                        yield None, s2[j]
                        j += 1
                    elif o[0] == "delete":
                        yield s1[i], None
                        i += 1
                    elif o[0] == "replace":
                        yield s1[i], s2[j]
                        i += 1
                        j += 1
                else:
    >               yield s1[i], s2[j]
    E               IndexError: list index out of range
    
    qurator/dinglehopper/align.py:42: IndexError
    ===================================== short test summary info ======================================
    FAILED qurator/dinglehopper/tests/test_align.py::test_with_some_fake_ocr_errors - IndexError: lis...
    =========================== 1 failed, 43 passed, 18 deselected in 30.24s ===========================
    

    also stuck with: qurator/dinglehopper/tests/test_integ_table_extraction.py ..... [ 83%] qurator/dinglehopper/tests/test_integ_word_error_rate_ocr.py ..

    python version 3.9.0. Thanks.

    bug 
    opened by whisere 10
  • Add flexible character accuracy

    Add flexible character accuracy

    This is a first draft for adding the flexible character accuracy as suggested by @cneud in #32.

    C. Clausner, S. Pletschacher, A. Antonacopoulos , "Flexible character accuracy measure for reading-order-independent evaluation", Pattern Recognition Letters, Volume 131, March 2020, Pages 390-397

    There are still some open topics so I opened this pull request as draft so you may already comment on some issues.

    Handling of coefficients

    The algorithm uses a "range of coefficients for penalty calculation" (see Table 1 in the paper).

    | Coefficient | Min | Max | Step | |----------- |----:|----:|-----:| | minDist | 15 | 30 | 5 | | lengthDiff | 0 | 23 | 3 | | offset | 0 | 3 | 1 | | length | 0 | 5 | 1 |

    Should we make the coefficients configurable? If so, we might need a configuration file because handling 12 additional parameters on the command line is quite messy.

    The runs for each set of coefficients is also a good place for parallelization. Should we include parallelization at this point or is this a non-issue as typically the processors in ORC-D workflows are already busy doing other things?

    Penalty and distance functions

    The algorithm depends a lot on a penalty and a distance function. From a library point of view I would like them to be exchangeable with other functions. But from a ocrd-processor/cli perspective this is not really useful.

    So should we make the distance and penalty functions configurable?

    Use of caching

    Because of regularly splitting lines and repeating runs with different coefficients the algorithm needs a lot of caching to speed up the progress. The easiest way to do this is with Python < 3.9 is to use the @lru_cache annotation.

    The performance could benefit from a custom tailored cache but it would also add more code we have to maintain.

    Performance

    ~At the moment it takes several minutes to analyse real world paris of ground truth and ocr data.~

    opened by b2m 8
  • ADD lookup table for levenshtein matrix and tempcaching

    ADD lookup table for levenshtein matrix and tempcaching

    Hey,

    I've added a lookup table for the levenshtein matrix calculation, which reduces the calculation time between 10%-50%. The second part caches the matrix results temporarily, which skips two out of five levenshtein matrix calculations (with default settings). Hope you like it.

    opened by JKamlah 8
  • Skip when there is no file matching the pageId

    Skip when there is no file matching the pageId

    ocrd-dinglehopper should issue a warning and skip a page if there is no matching GT or OCR file for a page.

    Reported by @mnoelte in Gitter: https://gitter.im/OCR-D/Lobby?at=5f76f0750dbbcf3dfa50648f

    bug 
    opened by mikegerber 6
  • Honor TextEquiv index

    Honor TextEquiv index

    https://ocr-d.de/de/gt-guidelines/pagexml/pagecontent_xsd_Complex_Type_pc_TextLineType.html#TextLineType_TextEquiv

    @JKamlah wrote:

    This is due to the work with LAREX. We did some corrections with LAREX on line level to produce GT files. LAREX kept both the original text and the corrected text in the result file and separated them by index. The original text got the index 1 and the corrected ones index 0, not corrected lines got no index at all. I don't know if that is a LAREX specific procedure(?). Link to the LAREX example.

    PAGE specs:

    Used for sort order in case multiple TextEquivs are defined. The text content with the lowest index should be interpreted as the main text content.

    (See https://github.com/qurator-spk/dinglehopper/issues/5#issuecomment-709986931)

    bug 
    opened by mikegerber 6
  • replace usage of deprecated rapidfuzz APIs

    replace usage of deprecated rapidfuzz APIs

    The string_metric module got deprecated in v2.0.0. This replaces the usage with the new implementation. In addition this updates to the latest version of rapidfuzz which significantly reduces the memory usage for editops and slightly improves the performance.

    opened by maxbachmann 5
  • Proposal to introduce version pinning and license checcking

    Proposal to introduce version pinning and license checcking

    This pull requests introduces:

    • version pinning via pip-tools for reproducible builds.
    • license checking via pip-licenses and CircleCI.

    New CI features:

    • Licenses are checked for new builds (new branch) and when .appoved-licenses or requirements.txt changes.
    • The list of allowed licenses is kept in a separate file (.allowed-licenses) to be able to distinguish between changes in CI-Configuration/Tools and license list changes. Also makes transitioning to other tools easier 😁 .

    Pipelines showing different settings: https://app.circleci.com/pipelines/github/b2m/dinglehopper?branch=add-pip-licenses

    • trigger via change in requirements.txt
    • trigger via change in .allowed-licenses
    • not triggered by change in .circleci/config.xml

    Lessons learned:

    • there are no triggers for changed files on CircleCI
    • version pinning and license checks need to be performed once per environment as we (could) have environment markers in requirement files
    • pip-licenses does not support whitelisting on Python 3.5
    • version pinning might get painful to maintain
    • we may use some dependency checker like Dependabot to avoid missing security updates because of version pinning.

    As we now have a working example we can discuss how to proceed with this requirement in #54.

    opened by b2m 5
  • getLogger Irritation with regular CLI

    getLogger Irritation with regular CLI

    Issue description

    Using recent version (1778b3) of dinglehopper complains because of OCR-D-Logger

    dinglehopper 1300565-gt.xml 1300565.xml
    
    => 
    
    21:17:33.416 CRITICAL root - getLogger was called before initLogging. Source of the call:
    21:17:33.416 CRITICAL root -   File "/home/hartwig/Projekte/work/mlu/ulb/ulb-sachsen-anhalt-dinglehopper/qurator/dinglehopper/extracted_text.py", line 243, in get_first_textequiv
    21:17:33.416 CRITICAL root -     log = getLogger("processor.OcrdDinglehopperEvaluate")
    

    Even though all report-files are being generated, the output is somehow irritating.

    Steps to reproduce the issue

    1. call dinglehopper 1300565-gt.xml 1300565.xml (attached)

    What's the expected result?

    • No logging error or no logging at all if no OCR-D is around

    Additional details

    The Problem could be worked around if you use OCR-D's initLogging also within the context of the non-OCR-D-CLI, adding in cli.py something like this:

    initLogging()
    Config.progress = progress
    process(gt, ocr, report_prefix, metrics=metrics, textequiv_level=textequiv_level)
    

    Does dinglehopper want to stick with the OCR-D-Logger also in potential non-OCR-D-contexts? Further, it looks like dinglehopper is currently missing any dedicated logging-configuration, which couples it rather strong not only to the OCR-D-Logging-Logic, but also to it's configuration.

    1300565-test.zip

    bug 
    opened by M3ssman 5
  • only call `words_normalized` once

    only call `words_normalized` once

    words_normalized should only be called once, since it is quite slow, which has a large effect now that the string matching is faster. On my laptop this achieves the following performance improvement: Before:

    [max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
    	0:15.89 real,	9.61 user,	7.14 sys,	92704 mmem
    

    After:

    [max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
    	0:12.56 real,	7.88 user,	5.56 sys,	92836 mmem
    
    opened by maxbachmann 8
  • Improve visual alignment for longer documents

    Improve visual alignment for longer documents

    @stweil asked in #62:

    Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?

    enhancement 
    opened by mikegerber 1
  • Horrible failure with large documents

    Horrible failure with large documents

    @stweil reported in Gitter:

    Improvements of dinglehopper are very welcome. The old version took more than 4 hours to process two text files with 1875 lines each and required about 30 GB RAM. The new version terminates after 2 minutes, but with out of memory: it was killed by the Linux kernel after using more than 60 GB RAM. :-(

    @cneud also submitted a large document (a newspaper page).

    • [ ] Investigate why the new version uses even more memory
    • [ ] Consider falling back to more efficient algorithms if necessary
    • [ ] Consider a regression test for this
    bug 
    opened by mikegerber 20
  • Improve performance when calculating sequence alignment

    Improve performance when calculating sequence alignment

    Dinglehopper is using a custom Python implementation of the Levenshtein distance to calculate, score and show an alignment of two given texts.

    According to my performance analysis done for #47 the distance and editops functions of this custom implementation is the main bottleneck when comparing explicitly bad or big OCR results.

    In #48 I proposed to use the C based python-Levenshtein as a replacement, which we discarded for the following reasons:

    1. No support for aligning sequences of words (see comment by @mikegerber).
    2. Currently no active maintenance.
    3. Viral license (GPL 2)

    One alternative and fast implementation for distance calculation is RapidFuzz, where @maxbachmann already started to adress the issue of the distance calculation for arbitrary sequences in maxbachmann/RapidFuzz#100.

    At the moment RapidFuzz is not supporting the calculcation of edit operations (see comment by @maxbachmann).

    bug 
    opened by b2m 13
Owner
QURATOR-SPK
Curation Technologies
QURATOR-SPK
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

null 27.5k Jan 8, 2023
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

Maxim 32 Jul 24, 2022
A little but useful tool to explore OCR data extracted with `pytesseract` and `opencv`

Screenshot OCR Tool Extracting data from screen time screenshots in iOS and Android. We are exploring 3 options: Simple OCR with no text position usin

Gabriele Marini 1 Dec 7, 2021
Python tool that takes the OCR.space JSON output as input and draws a text overlay on top of the image.

OCR.space OCR Result Checker => Draw OCR overlay on top of image Python tool that takes the OCR.space JSON output as input, and draws an overlay on to

a9t9 4 Oct 18, 2022
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

SCUT-CTW1500 Datasets We have updated annotations for both train and test set. Train: 1000 images [images][annos] Additional point annotation for each

Yuliang Liu 600 Dec 18, 2022
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

Jaided AI 16.7k Jan 3, 2023
A Python wrapper for the tesseract-ocr API

tesserocr A simple, Pillow-friendly, wrapper around the tesseract-ocr API for Optical Character Recognition (OCR). tesserocr integrates directly with

Fayez 1.7k Dec 31, 2022
FastOCR is a desktop application for OCR API.

FastOCR FastOCR is a desktop application for OCR API. Installation Arch Linux fastocr-git @ AUR Build from AUR or install with your favorite AUR helpe

Bruce Zhang 58 Jan 7, 2023
OCR-D-compliant page segmentation

ocrd_segment This repository aims to provide a number of OCR-D-compliant processors for layout analysis and evaluation. Installation In your virtual e

OCR-D 59 Sep 10, 2022
OCR software for recognition of handwritten text

Handwriting OCR The project tries to create software for recognition of a handwritten text from photos (also for Czech language). It uses computer vis

Břetislav Hájek 562 Jan 3, 2023
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023
A pure pytorch implemented ocr project including text detection and recognition

ocr.pytorch A pure pytorch implemented ocr project. Text detection is based CTPN and text recognition is based CRNN. More detection and recognition me

coura 444 Dec 30, 2022
python ocr using tesseract/ with EAST opencv detector

pytextractor python ocr using tesseract/ with EAST opencv text detector Uses the EAST opencv detector defined here with pytesseract to extract text(de

Danny Crasto 38 Dec 5, 2022
Run tesseract with the tesserocr bindings with @OCR-D's interfaces

ocrd_tesserocr Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr Introduction This package offers OCR-D complia

OCR-D 38 Oct 14, 2022
A set of workflows for corpus building through OCR, post-correction and normalisation

PICCL: Philosophical Integrator of Computational and Corpus Libraries PICCL offers a workflow for corpus building and builds on a variety of tools. Th

Language Machines 41 Dec 27, 2022
Tensorflow-based CNN+LSTM trained with CTC-loss for OCR

Overview This collection demonstrates how to construct and train a deep, bidirectional stacked LSTM using CNN features as input with CTC loss to perfo

Jerod Weinman 489 Dec 21, 2022