An OCR evaluation tool

QURATOR-SPK

Last update: Dec 20, 2022

Related tags

Computer Vision ocr page ocr-evaluation alto-xml page-xml alto qurator ocr-d

Overview

dinglehopper

dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report.

Goals

Useful
- As a UI tool
- For an automated evaluation
- As a library
Unicode support

Installation

It's best to use pip, e.g.:

sudo pip install .

Usage

Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]

  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results.
  In that case, use --no-metrics to disable the then meaningless metrics and
  also change the color scheme from green/red to blue.

  The comparison report will be written to $REPORT_PREFIX.{html,json}, where
  $REPORT_PREFIX defaults to "report". The reports include the character
  error rate (CER) and the word error rate (WER).

  By default, the text of PAGE files is extracted on 'region' level. You may
  use "--textequiv-level line" to extract from the level of TextLine tags.

Options:
  --metrics / --no-metrics  Enable/disable metrics and green/red
  --textequiv-level LEVEL   PAGE TextEquiv level to extract text from
  --progress                Show progress bar
  --help                    Show this message and exit.

For example:

dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml

This generates report.html and report.json.

dinglehopper-extract

The tool dinglehopper-extract extracts the text of the given input file on stdout, for example:

dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml

OCR-D

As a OCR-D processor:

ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL

This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL filegroup.

The OCR-D processor has these parameters:

Parameter	Meaning
`-P metrics false`	Disable metrics and the green-red color scheme (default: enabled)
`-P textequiv_level line`	(PAGE) Extract text from TextLine level (default: TextRegion level)

For example:

ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false

Developer information

Please refer to README-DEV.md.

Comments

Support comparing line GT directories with line OCR directories
In #62, @stweil's original problem was - as I understand it - to compare a directory with line GT text files with a directory of line OCR text files. For now I've created fake test data to implement this fake-line-gt.zip. It looks like this:

% ls * gt: line001.gt.txt line003.gt.txt line005.gt.txt line007.gt.txt line009.gt.txt line011.gt.txt line002.gt.txt line004.gt.txt line006.gt.txt line008.gt.txt line010.gt.txt some-ocr: line001.some-ocr.txt line003.some-ocr.txt line005.some-ocr.txt line007.some-ocr.txt line009.some-ocr.txt line011.some-ocr.txt line002.some-ocr.txt line004.some-ocr.txt line006.some-ocr.txt line008.some-ocr.txt line010.some-ocr.txt

A first implementation should compare the text of pairs files (matching by filename) and produce a report of metrics & differences over all of the lines. First idea of the CLI interface:

dinglehopper-lines gt/ --gt-suffix .gt.txt some-ocr/ --ocr-suffix .some-ocr.txt

I'm not sure if this will be the final CLI interface but it's what seems necessary on first glance.
enhancement
opened by mikegerber 22
Switch from custom Levenshtein to python-Levenshtein
As the distance and editops calculation is a performance bottleneck in this application I substituted the custom Levenshtein implementation to the C implementation in the python-Levenshtein package.

We now also have separate entrypoints for texts with unicode normalization and without. For example when calculating the flexible character accuracy in #47 the normalization can be done more efficiently once upon preprocessing.

Here are some benchmarks for the behaviour before and after this change using data already available in the repository:

| Command | Mean [s] | Min [s] | Max [s] | Relative | |:---|---:|---:|---:|---:| | before brochrnx_73075507X | 3.359 ± 0.039 | 3.305 | 3.408 | 8.29 ± 0.13 | | after brochrnx_73075507X | 0.405 ± 0.004 | 0.399 | 0.409 | 1.00 | | before actevedef_718448162 CALAMARI | 34.410 ± 0.561 | 33.918 | 35.362 | 84.91 ± 1.61 | | after actevedef_718448162 CALAMARI | 0.926 ± 0.010 | 0.911 | 0.935 | 2.28 ± 0.03 | | before actevedef_718448162 TESS | 34.103 ± 0.305 | 33.685 | 34.529 | 84.16 ± 1.11 | | after actevedef_718448162 TESS | 0.909 ± 0.008 | 0.899 | 0.921 | 2.24 ± 0.03 |

Statistics as JSON

For details on how I created the benchmarks see the gist Performance benchmarking for dinglehopper using hyperfine.

This also adresses point 1 in #44.

Note that with this changes we can only compare unicode documents with a total maximum of 1 114 111 unique characters (or words). The limit comes from Python's chr function. But I suspect this should not be an issue at the moment or the near future.
opened by b2m 20
Sort textlines with missing indices
Python's sorted method will fail with a TypeError when called with None and Integers:

>>> sorted([None, 1]) TypeError: '<' not supported between instances of 'int' and 'NoneType'

Therefore we are using float('inf') instead of None in case of missing textline indices.
opened by b2m 11

dinglehopper keep hanging and test errors

running dinglehopper gt txt and dinglehopper-line-dirs keep hanging without message, and pytest returns errors:

collected 62 items / 18 deselected / 44 selected                                                   

qurator/dinglehopper/tests/extracted_text_test.py .............                              [ 29%]
qurator/dinglehopper/tests/test_align.py .......F..                                          [ 52%]
qurator/dinglehopper/tests/test_character_error_rate.py ..                                   [ 56%]
qurator/dinglehopper/tests/test_edit_distance.py .                                           [ 59%]
qurator/dinglehopper/tests/test_editops.py ..                                                [ 63%]
qurator/dinglehopper/tests/test_ocr_files.py .............                                   [ 93%]
qurator/dinglehopper/tests/test_word_error_rate.py ...                                       [100%]

============================================= FAILURES =============================================
__________________________________ test_with_some_fake_ocr_errors __________________________________

    def test_with_some_fake_ocr_errors():
>       result = list(
            align(
                "Über die vielen Sorgen wegen desselben vergaß",
                "SomeJunk MoreJunk Übey die vielen Sorgen wegen AdditionalJunk deffelben vcrgab",
            )
        )

qurator/dinglehopper/tests/test_align.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

s1 = ['Ü', 'b', 'e', 'r', ' ', 'd', ...], s2 = ['S', 'o', 'm', 'e', 'J', 'u', ...]

    def seq_align(s1, s2):
        """Align general sequences."""
        s1 = list(s1)
        s2 = list(s2)
        ops = levenshtein_editops(s1, s2)
        i = 0
        j = 0
    
        while i < len(s1) or j < len(s2):
            o = None
            try:
                ot = ops[0]
                if ot[1] == i and ot[2] == j:
                    ops = ops[1:]
                    o = ot
            except IndexError:
                pass
    
            if o:
                if o[0] == "insert":
                    yield None, s2[j]
                    j += 1
                elif o[0] == "delete":
                    yield s1[i], None
                    i += 1
                elif o[0] == "replace":
                    yield s1[i], s2[j]
                    i += 1
                    j += 1
            else:
>               yield s1[i], s2[j]
E               IndexError: list index out of range

qurator/dinglehopper/align.py:42: IndexError
===================================== short test summary info ======================================
FAILED qurator/dinglehopper/tests/test_align.py::test_with_some_fake_ocr_errors - IndexError: lis...
=========================== 1 failed, 43 passed, 18 deselected in 30.24s ===========================

also stuck with: qurator/dinglehopper/tests/test_integ_table_extraction.py ..... [ 83%] qurator/dinglehopper/tests/test_integ_word_error_rate_ocr.py ..

python version 3.9.0. Thanks.

bug

opened by whisere 10

Add flexible character accuracy

This is a first draft for adding the flexible character accuracy as suggested by @cneud in #32.

C. Clausner, S. Pletschacher, A. Antonacopoulos , "Flexible character accuracy measure for reading-order-independent evaluation", Pattern Recognition Letters, Volume 131, March 2020, Pages 390-397

There are still some open topics so I opened this pull request as draft so you may already comment on some issues.

Handling of coefficients

The algorithm uses a "range of coefficients for penalty calculation" (see Table 1 in the paper).

| Coefficient | Min | Max | Step | |----------- |----:|----:|-----:| | minDist | 15 | 30 | 5 | | lengthDiff | 0 | 23 | 3 | | offset | 0 | 3 | 1 | | length | 0 | 5 | 1 |

Should we make the coefficients configurable? If so, we might need a configuration file because handling 12 additional parameters on the command line is quite messy.

The runs for each set of coefficients is also a good place for parallelization. Should we include parallelization at this point or is this a non-issue as typically the processors in ORC-D workflows are already busy doing other things?

Penalty and distance functions

The algorithm depends a lot on a penalty and a distance function. From a library point of view I would like them to be exchangeable with other functions. But from a ocrd-processor/cli perspective this is not really useful.

So should we make the distance and penalty functions configurable?

Use of caching

Because of regularly splitting lines and repeating runs with different coefficients the algorithm needs a lot of caching to speed up the progress. The easiest way to do this is with Python < 3.9 is to use the @lru_cache annotation.

The performance could benefit from a custom tailored cache but it would also add more code we have to maintain.

Performance

~At the moment it takes several minutes to analyse real world paris of ground truth and ocr data.~

opened by b2m 8
ADD lookup table for levenshtein matrix and tempcaching

Hey,

I've added a lookup table for the levenshtein matrix calculation, which reduces the calculation time between 10%-50%. The second part caches the matrix results temporarily, which skips two out of five levenshtein matrix calculations (with default settings). Hope you like it.

opened by JKamlah 8
Skip when there is no file matching the pageId

ocrd-dinglehopper should issue a warning and skip a page if there is no matching GT or OCR file for a page.

Reported by @mnoelte in Gitter: https://gitter.im/OCR-D/Lobby?at=5f76f0750dbbcf3dfa50648f
bug

opened by mikegerber 6
Honor TextEquiv index

https://ocr-d.de/de/gt-guidelines/pagexml/pagecontent_xsd_Complex_Type_pc_TextLineType.html#TextLineType_TextEquiv

@JKamlah wrote:

This is due to the work with LAREX. We did some corrections with LAREX on line level to produce GT files. LAREX kept both the original text and the corrected text in the result file and separated them by index. The original text got the index 1 and the corrected ones index 0, not corrected lines got no index at all. I don't know if that is a LAREX specific procedure(?). Link to the LAREX example.

PAGE specs:

Used for sort order in case multiple TextEquivs are defined. The text content with the lowest index should be interpreted as the main text content.

(See https://github.com/qurator-spk/dinglehopper/issues/5#issuecomment-709986931)
bug

opened by mikegerber 6
replace usage of deprecated rapidfuzz APIs

The string_metric module got deprecated in v2.0.0. This replaces the usage with the new implementation. In addition this updates to the latest version of rapidfuzz which significantly reduces the memory usage for editops and slightly improves the performance.

opened by maxbachmann 5
Proposal to introduce version pinning and license checcking
This pull requests introduces:

version pinning via pip-tools for reproducible builds.

license checking via pip-licenses and CircleCI.

New CI features:

Licenses are checked for new builds (new branch) and when .appoved-licenses or requirements.txt changes.

The list of allowed licenses is kept in a separate file (.allowed-licenses) to be able to distinguish between changes in CI-Configuration/Tools and license list changes. Also makes transitioning to other tools easier 😁 .

Pipelines showing different settings: https://app.circleci.com/pipelines/github/b2m/dinglehopper?branch=add-pip-licenses

trigger via change in requirements.txt

trigger via change in .allowed-licenses

not triggered by change in .circleci/config.xml

Lessons learned:

there are no triggers for changed files on CircleCI

version pinning and license checks need to be performed once per environment as we (could) have environment markers in requirement files

pip-licenses does not support whitelisting on Python 3.5

version pinning might get painful to maintain

we may use some dependency checker like Dependabot to avoid missing security updates because of version pinning.

As we now have a working example we can discuss how to proceed with this requirement in #54.
opened by b2m 5
getLogger Irritation with regular CLI
Issue description

Using recent version (1778b3) of dinglehopper complains because of OCR-D-Logger

dinglehopper 1300565-gt.xml 1300565.xml => 21:17:33.416 CRITICAL root - getLogger was called before initLogging. Source of the call: 21:17:33.416 CRITICAL root - File "/home/hartwig/Projekte/work/mlu/ulb/ulb-sachsen-anhalt-dinglehopper/qurator/dinglehopper/extracted_text.py", line 243, in get_first_textequiv 21:17:33.416 CRITICAL root - log = getLogger("processor.OcrdDinglehopperEvaluate")

Even though all report-files are being generated, the output is somehow irritating.

Steps to reproduce the issue

call dinglehopper 1300565-gt.xml 1300565.xml (attached)

What's the expected result?

No logging error or no logging at all if no OCR-D is around

Additional details

The Problem could be worked around if you use OCR-D's initLogging also within the context of the non-OCR-D-CLI, adding in cli.py something like this:

initLogging() Config.progress = progress process(gt, ocr, report_prefix, metrics=metrics, textequiv_level=textequiv_level)

Does dinglehopper want to stick with the OCR-D-Logger also in potential non-OCR-D-contexts? Further, it looks like dinglehopper is currently missing any dedicated logging-configuration, which couples it rather strong not only to the OCR-D-Logging-Logic, but also to it's configuration.

1300565-test.zip
bug
opened by M3ssman 5

only call `words_normalized` once

words_normalized should only be called once, since it is quite slow, which has a large effect now that the string matching is faster. On my laptop this achieves the following performance improvement: Before:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:15.89 real,	9.61 user,	7.14 sys,	92704 mmem

After:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:12.56 real,	7.88 user,	5.56 sys,	92836 mmem

opened by maxbachmann 8

Improve visual alignment for longer documents

@stweil asked in #62:

Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?

enhancement

opened by mikegerber 1
Horrible failure with large documents
@stweil reported in Gitter:

Improvements of dinglehopper are very welcome. The old version took more than 4 hours to process two text files with 1875 lines each and required about 30 GB RAM. The new version terminates after 2 minutes, but with out of memory: it was killed by the Linux kernel after using more than 60 GB RAM. :-(

@cneud also submitted a large document (a newspaper page).

[ ] Investigate why the new version uses even more memory

[ ] Consider falling back to more efficient algorithms if necessary

[ ] Consider a regression test for this

bug
opened by mikegerber 20
Improve performance when calculating sequence alignment
Dinglehopper is using a custom Python implementation of the Levenshtein distance to calculate, score and show an alignment of two given texts.

According to my performance analysis done for #47 the distance and editops functions of this custom implementation is the main bottleneck when comparing explicitly bad or big OCR results.

In #48 I proposed to use the C based python-Levenshtein as a replacement, which we discarded for the following reasons:

No support for aligning sequences of words (see comment by @mikegerber).

Currently no active maintenance.

Viral license (GPL 2)

One alternative and fast implementation for distance calculation is RapidFuzz, where @maxbachmann already started to adress the issue of the distance calculation for arbitrary sequences in maxbachmann/RapidFuzz#100.

At the moment RapidFuzz is not supporting the calculcation of edit operations (see comment by @maxbachmann).
bug
opened by b2m 13

Owner

QURATOR-SPK

Curation Technologies

GitHub

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

27.5k Jan 8, 2023

Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

5 Dec 6, 2021

A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

32 Jul 24, 2022

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Open Semantic Search https://opensemanticsearch.org Integrated search server, ETL framework for document processing (crawling, text extraction, text a

684 Jan 6, 2023

A little but useful tool to explore OCR data extracted with `pytesseract` and `opencv`

Screenshot OCR Tool Extracting data from screen time screenshots in iOS and Android. We are exploring 3 options: Simple OCR with no text position usin

1 Dec 7, 2021

Python tool that takes the OCR.space JSON output as input and draws a text overlay on top of the image.

OCR.space OCR Result Checker => Draw OCR overlay on top of image Python tool that takes the OCR.space JSON output as input, and draws an overlay on to

4 Oct 18, 2022

This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

SCUT-CTW1500 Datasets We have updated annotations for both train and test set. Train: 1000 images [images][annos] Additional point annotation for each

600 Dec 18, 2022

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set