Run tesseract with the tesserocr bindings with @OCR-D's interfaces

Overview

ocrd_tesserocr

Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr

image image image Docker Automated build

Introduction

This package offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr. (Each processor is a parameterizable step in a configurable workflow of the OCR-D functional model. There are usually various alternative processor implementations for each step. Data is represented with METS and PAGE.)

It includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation), script identification, font style recognition and text recognition.

Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. In PAGE, image results are referenced (read and written) via AlternativeImage, text results via TextEquiv, font attributes via TextStyle, script via @primaryScript, deskewing via @orientation, cropping via Border and segmentation via Region / TextLine / Word elements with Coords/@points.

Installation

Required ubuntu packages:

  • Tesseract headers (libtesseract-dev)
  • Some Tesseract language models (tesseract-ocr-{eng,deu,frk,...} or script models (tesseract-ocr-script-{latn,frak,...}); or better yet custom trained models
  • Leptonica headers (libleptonica-dev)

From PyPI

This is the best option if you want to use the stable, released version.


NOTE

ocrd_tesserocr requires Tesseract >= 4.1.0. The Tesseract packages bundled with Ubuntu < 19.10 are too old. If you are on Ubuntu 18.04 LTS, please use Alexander Pozdnyakov's PPA repository, which has up-to-date builds of Tesseract and its dependencies:

sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update

sudo apt-get install git python3 python3-pip libtesseract-dev libleptonica-dev tesseract-ocr-eng tesseract-ocr wget
pip install ocrd_tesserocr

With docker

This is the best option if you want to run the software in a container.

You need to have Docker

docker pull ocrd/tesserocr

To run with docker:

docker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...

From git

This is the best option if you want to change the source code or install the latest, unpublished changes.

We strongly recommend to use venv.

git clone https://github.com/OCR-D/ocrd_tesserocr
cd ocrd_tesserocr
sudo make deps-ubuntu # or manually with apt-get
make deps        # or pip install -r requirements
make install     # or pip install .

Usage

For details, see docstrings in the individual processors and ocrd-tool.json descriptions, or simply --help.

Available OCR-D processors are:

  • ocrd-tesserocr-crop (simplistic)
    • sets Border of pages and adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-deskew (for skew and orientation; mind operation_level)
    • sets @orientation of regions or pages and adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-binarize (Otsu – not recommended)
    • adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-recognize (optionally including segmentation; mind segmentation_level and textequiv_level)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation (optionally)
    • adds TextRegions to TableRegions and sets their @orientation (optionally)
    • adds TextLines to TextRegions (optionally)
    • adds Words to TextLines (optionally)
    • adds Glyphs to Words (optionally)
    • adds TextEquiv
  • ocrd-tesserocr-segment (all-in-one segmentation – recommended; delegates to recognize)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation
    • adds TextRegions to TableRegions and sets their @orientation
    • adds TextLines to TextRegions
    • adds Words to TextLines
    • adds Glyphs to Words
  • ocrd-tesserocr-segment-region (only regions – with overlapping bboxes; delegates to recognize)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation
  • ocrd-tesserocr-segment-table (only table cells; delegates to recognize)
    • adds TextRegions to TableRegions
  • ocrd-tesserocr-segment-line (only lines – from overlapping regions; delegates to recognize)
    • adds TextLines to TextRegions
  • ocrd-tesserocr-segment-word (only words; delegates to recognize)
    • adds Words to TextLines
  • ocrd-tesserocr-fontshape (only text style – via Tesseract 3 models)
    • adds TextStyle to Words

The text region @types detected are (from Tesseract's PolyBlockType):

  • paragraph: normal block (aligned with others in the column)
  • floating: unaligned block (is in a cross-column pull-out region)
  • heading: block that spans more than one column
  • caption: block for text that belongs to an image

If you are unhappy with these choices, consider post-processing with a dedicated custom processor in Python, or by modifying the PAGE files directly (e.g. xmlstarlet ed --inplace -u '//pc:TextRegion/@type[.="floating"]' -v paragraph filegrp/*.xml).

All segmentation is currently done as bounding boxes only by default, i.e. without precise polygonal outlines. For dense page layouts this means that neighbouring regions and neighbouring text lines may overlap a lot. If this is a problem for your workflow, try post-processing like so:

  • after line segmentation: use ocrd-cis-ocropy-resegment for polygonalization, or ocrd-cis-ocropy-clip on the line level
  • after region segmentation: use ocrd-segment-repair with plausibilize (and sanitize after line segmentation)

It also means that Tesseract should be allowed to segment across multiple hierarchy levels at once, to avoid introducing inconsistent/duplicate text line assignments in text regions, or word assignments in text lines. Hence,

  • prefer ocrd-tesserocr-recognize with segmentation_level=region over ocrd-tesserocr-segment followed by ocrd-tesserocr-recognize, if you want to do all in one with Tesseract,
  • prefer ocrd-tesserocr-recognize with segmentation_level=line over ocrd-tesserocr-segment-line followed by ocrd-tesserocr-recognize, if you want to do everything but region segmentation with Tesseract,
  • prefer ocrd-tesserocr-segment over ocrd-tesserocr-segment-region followed by (ocrd-tesserocr-segment-table and) ocrd-tesserocr-segment-line, if you want to do everything but recognition with Tesseract.

However, you can also run ocrd-tesserocr-segment* and ocrd-tesserocr-recognize with shrink_polygons=True to get polygons by post-processing each segment, shrinking to the convex hull of all its symbol outlines.

Testing

make test

This downloads some test data from https://github.com/OCR-D/assets under repo/assets, and runs some basic test of the Python API as well as the CLIs.

Set PYTEST_ARGS="-s --verbose" to see log output (-s) and individual test results (--verbose).

Comments
  • Add fontshape processor and all-in-one segmentation

    Add fontshape processor and all-in-one segmentation

    We can probably remove both the old segment-region/line/word and new (all-in-one) segment altogether now that we can configure them via overwrite_* and textequiv_level in recognize. Or we keep the CLI names, but delegate to recognize @kba?

    opened by bertsky 58
  • Memory leaks

    Memory leaks

    The memory usage of ocrd-tesserocr-segment-region increases for each page, resulting in a total of about 7 GB for 200 pages, 8 GB for 248 pages, 10 GB for 282 pages, 11 GB for 313 pages (observed for http://nbn-resolving.de/urn:nbn:de:bsz:180-digad-22977).

    ocrd-tesserocr-segment-line shows a similar effect.

    For that book, a machine with 8 GB RAM would have started swapping, thus slowing down the process extremely. Even a large server would get memory problems when processing large books with more than 1000 pages in parallel.

    opened by stweil 31
  • improve segmentation

    improve segmentation

    This fixes #101 (using raw_lines by default for textline images, but there are still some corner cases that need to be fixed in Tesseract) and brings a number of segmentation-related improvements:

    • interprete overwrite_regions more consistently
    • annotate @orientation (independent of dedicated deskewing processor) for vertical and @type for all other text blocks
    • no separators and noise regions in reading order
    • segment tables into cells and lines so they can be OCRed, too
    opened by bertsky 28
  • Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid)

    Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid)

    21:19:10.443 INFO processor.TesserocrSegmentLine - INPUT FILE 65 / phys396119
    21:19:10.577 INFO processor.TesserocrSegmentLine - Page 'phys396119' images will use DPI estimated from segmentation
    21:19:10.850 ERROR shapely.geos - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 0 107 at 0 107
    Traceback (most recent call last):
      File "/home/stweil/src/github/OCR-D/venv-20200904/bin/ocrd-tesserocr-segment-line", line 8, in <module>
        sys.exit(ocrd_tesserocr_segment_line())
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 829, in __call__
        return self.main(*args, **kwargs)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 610, in invoke
        return callback(*args, **kwargs)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 26, in ocrd_tesserocr_segment_line
        return ocrd_cli_wrap_processor(TesserocrSegmentLine, *args, **kwargs)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
        run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
        processor.process()
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/segment_line.py", line 119, in process
        interline = line_poly.intersection(region_poly)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/geometry/base.py", line 676, in intersection
        return geom_factory(self.impl['intersection'](self, other))
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/topology.py", line 70, in __call__
        self._check_topology(err, this, other)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/topology.py", line 38, in _check_topology
        self.fn.__name__, repr(geom)))
    shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f89253f7c88>
    
    opened by stweil 23
  • Integration with OCR-D/spec#169 (resource manager)

    Integration with OCR-D/spec#169 (resource manager)

    This is just a proof-of-concept that it is possible to load tesseract models installed with ocrd resmgr into the cache directory.

    The tricky part here is that there is only one TESSDATA_PREFIX but potentially multiple directories with models. So while it is no problem to look up models in various folders, only one of the can be used as the TESSDATA_PREFIX. Suggestions for a reasonable resolution to this dilemma are welcome.

    opened by kba 22
  • support more textequiv levels

    support more textequiv levels

    This is an attempt to implement the other annotation levels. In my opinion, the behaviour for the different levels cannot be made completely analogous with Tesseract: simply pointing it to rectangles for words and glyphs (from an external layout segmentation) would produce results of far worse quality than always recognizing one complete line and allowing its own segmentation below it (accessible by iterators). In contrast, from the line level upwards we can reliably use its respective page segmentation mode (SINGLE_LINE / SINGLE_BLOCK / AUTO). Perhaps warnings and exceptions should be dealt with in a different, more systematic way though.

    opened by bertsky 18
  • move to AlternativeImage feature selectors in OCR-D/core#294:

    move to AlternativeImage feature selectors in OCR-D/core#294:

    • all: use second output position as fileGrp USE to produce AlternativeImage
    • all: rid of MetadataItem/Labels-related FIXME: with the updated PAGE model, we can now use @externalModel and @externalId
    • all: use OcrdExif.resolution instead of xResolution
    • all: create images with monotonically growing @comments (features)
    • crop: use ocrd_utils.crop_image instead of PIL.Image.crop
    • crop: fix bug when trying to access page_image if there are already region coordinates that we are ignoring
    • crop: filter images already deskewed and cropped! (we must crop ourselves, and deskewing can not happen until afterwards)
    • deskew: fix bugs in configuration-dependent corner cases related to whether deskewing has already been applied (on the page or region level):
      • for the page image, never use images already rotated (both for page level and region level processing, but for the region level, do rotate images ad hoc if @orientation is present on the page level)
      • for the region image, never use images already rotated (except for our own page-level rotation)
    • segment-region: forgot to add feature "cropped" when producing cropped images
    bug enhancement 
    opened by bertsky 16
  • pip install ocrd_tesserocr fails with tesseract  version 4.0.0-beta-26-gfd49

    pip install ocrd_tesserocr fails with tesseract version 4.0.0-beta-26-gfd49

    I use pip install ocrd_tesserocr to install ocrd_tesseract into my virtualenv environment. The installation fails with:

    ...
      Running setup.py bdist_wheel for tesserocr ... error
      Complete output from command /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-k_dgo547/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-q7mozwr8 --python-tag cp37:
      Supporting tesseract v4.0.0
      Configs from pkg-config: {'include_dirs': ['/usr/include'], 'libraries': ['tesseract', 'lept'], 'cython_compile_time_env': {'TESSERACT_VERSION': 67108864}}
      running bdist_wheel
      running build
      running build_ext
      building 'tesserocr' extension
      creating build
      creating build/temp.linux-x86_64-3.7
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -fPIC -I/usr/include -I/usr/include/python3.7m -c tesserocr.cpp -o build/temp.linux-x86_64-3.7/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
      tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_16PyResultIterator_8GetBestLSTMSymbolChoices(__pyx_obj_9tesserocr_PyResultIterator*)':
      tesserocr.cpp:12196:43: error: 'class tesseract::ResultIterator' has no member named 'GetBestLSTMSymbolChoices'
         __pyx_v_output = (__pyx_v_self->_riter->GetBestLSTMSymbolChoices()[0]);
                                                 ^~~~~~~~~~~~~~~~~~~~~~~~
      error: command 'gcc' failed with exit status 1
    
      ----------------------------------------
      Failed building wheel for tesserocr
      Running setup.py clean for tesserocr
    Failed to build tesserocr
    Installing collected packages: tesserocr, ocrd-tesserocr
      Running setup.py install for tesserocr ... error
        Complete output from command /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-k_dgo547/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-fc87h61b/install-record.txt --single-version-externally-managed --compile --install-headers /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/include/site/python3.7/tesserocr:
        Supporting tesseract v4.0.0
        Configs from pkg-config: {'include_dirs': ['/usr/include'], 'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 67108864}}
        running install
        running build
        running build_ext
        building 'tesserocr' extension
        creating build
        creating build/temp.linux-x86_64-3.7
        gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -fPIC -I/usr/include -I/usr/include/python3.7m -c tesserocr.cpp -o build/temp.linux-x86_64-3.7/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
        tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_16PyResultIterator_8GetBestLSTMSymbolChoices(__pyx_obj_9tesserocr_PyResultIterator*)':
        tesserocr.cpp:12196:43: error: 'class tesseract::ResultIterator' has no member named 'GetBestLSTMSymbolChoices'
           __pyx_v_output = (__pyx_v_self->_riter->GetBestLSTMSymbolChoices()[0]);
                                                   ^~~~~~~~~~~~~~~~~~~~~~~~
        error: command 'gcc' failed with exit status 1
    ...
    

    tesseract is installed on the system:

    tesseract 4.0.0-beta.4-26-gfd49
     leptonica-1.77.0
      libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.1) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1
     Found AVX
     Found SSE
    
    opened by finkf 16
  • Superfluous newlines

    Superfluous newlines

    At the moment, superfluous newlines are appended to the TextEquiv/Unicode entries:

                        <pc:TextEquiv>
                            <pc:Unicode>Groſzmaͤchtigſter</pc:Unicode>
                        </pc:TextEquiv>
                        <pc:TextEquiv>
                            <pc:Unicode>stzmächtigstcr
    </pc:Unicode>
    
    opened by finkf 16
  • Make it clearer which Tesseract engine is being used

    Make it clearer which Tesseract engine is being used

    Since Tesseract 4, two OCR engines are available: rule-based (i.e. --oem 0), LSTM (--oem 1). The command-line also exposes an ensemble of the two OCR engines (--oem 2). The documentation for ocrd-tesserocr-recognize does not make it clear which engine is used and using either the following parameters seems to have no effect on the recognition results:

    • -P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "0" }'
    • -P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "1" }'
    • -P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "2" }'

    Which one of the OCR engines are we currently using?

    opened by Witiko 12
  • ocrd-tesserocr-segment: segmentation fault

    ocrd-tesserocr-segment: segmentation fault

    And with this image:

    https://digi.ub.uni-heidelberg.de/diglitData/v/justinian1627bd2_-_1281.tif

    and ocrd.sif (singularity container) created from docker ocrd_all at Nov 9 10:13 2021 & at Jan 17 15:11 2022 [UPDATE]

    and this workflow:

    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd workspace init >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff OCR-D-IMG/00001.tif >>ocrd.log 2>&1 || exit
    
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-olena-binarize -P k 0.10 -I OCR-D-IMG -O OCR-D-001 >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-anybaseocr-crop -I OCR-D-001 -O OCR-D-002 >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-olena-binarize -I OCR-D-002 -O OCR-D-003 >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D-004 >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true -I OCR-D-004 -O OCR-D-005 >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-calamari-recognize -I OCR-D-005 -O OCR-D-OCR -P checkpoint "$HOME/ocrd_models/calamari/calamari_models_experimental/historical_french_2020-10-14/*.ckpt.json" >>ocrd.log 2>&1 || exit
    

    I'll get a segmentation fault

    Core was generated by `/usr/bin/python3 /usr/bin/ocrd-tesserocr-segment -P find_tables false -P shrink'.
    Program terminated with signal 11, Segmentation fault.
    
    opened by jbarth-ubhd 11
  • reverse order of glyphs inside words in PAGE-File for RTL languages

    reverse order of glyphs inside words in PAGE-File for RTL languages

    when using for example Arabic model, recognition works fine but the words inside the generated PAGE-XML contains reversed letters. But the sequence of words itself is correct, here an example: generated word with wrong sequence of letters:

                   <pc:Word id="region0001_line0001_word0000">
                        <pc:Coords points="1620,372 1620,402 1703,402 1703,375 1647,376"/>
                        <pc:TextEquiv conf="0.877831573486328">
                            <pc:Unicode>رصم</pc:Unicode>
                        </pc:TextEquiv>
                    </pc:Word>
    

    but the line containing the recogized word should look like this:

                            <pc:Unicode>مصر</pc:Unicode>
    

    (I know it is not easy to see clearly that it is reversed because the letters in Arabic changes appearance depending on position inside word, but this is handled by font.)

    Here is the equivalent portion of the image: the word Msr

    REMARK: when using tesseract as standalone and generating alto, the sequence is correct!

    opened by MihoMahi 3
  • montfaucon1719bd2_1, page 210, ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true

    montfaucon1719bd2_1, page 210, ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true

    this image

    https://digi.ub.uni-heidelberg.de/diglitData/v/montfaucon1719bd2_1.210.tif

    UPDATE same for https://digi.ub.uni-heidelberg.de/diglitData/v/montfaucon1719bd2_1.168a_Planche_72.tif

    with this workflow (latest ocrd_all as of 2021-12-01)

    ocrd workspace init 
    ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff OCR-D-IMG/00001.tif 
    
    ocrd-olena-binarize -P k 0.10 -I OCR-D-IMG -O OCR-D-001 
    ocrd-anybaseocr-crop -I OCR-D-001 -O OCR-D-002 
    ocrd-olena-binarize -I OCR-D-002 -O OCR-D-003 
    ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D-004 
    ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true -I OCR-D-004 -O OCR-D-005 
    ocrd-calamari-recognize -I OCR-D-005 -O OCR-D-OCR -P checkpoint "$HOME/ocrd/_models/ocrd-calamari-recognize/c1_latin-script-hist-3/*.ckpt.json" 
    

    leads to this error messages:

    10:06:58.121 INFO processor.TesserocrSegment - INPUT FILE 0 / P_00001
    10:06:59.193 INFO processor.TesserocrSegment - Page 'P_00001' images will use 333 DPI from image 
    meta-data
    10:06:59.193 INFO processor.TesserocrSegment - Processing page 'P_00001'
    10:07:00.229 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-005_00001.IMG-BIN, 
    file_grp: OCR-D-005, path: OCR-D-005/OCR-D-005_00001.IMG-BIN.png
    /build/ocrd_tesserocr/ocrd_tesserocr/recognize.py:510: ShapelyDeprecationWarning: The proxy 
    geometries (through the 'asShape()', 'asPolygon()' or 'PolygonAdapter()' constructors) are 
    deprecated and will be removed in Shapely 2.0. Use the 'shape()' function or the standard 
    'Polygon()' constructor instead.
      for symbol in iterate_level(it, RIL.SYMBOL, parent=RIL.BLOCK)])
    Exception ignored in: <bound method BaseGeometry.__del__ of 
    <shapely.geometry.polygon.PolygonAdapter object at 0x7fc431060358>>
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 209, in __del__
        self._empty(val=None)
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 199, in _empty
        self._is_empty = True
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/proxy.py", line 44, in __setattr__
        object.__setattr__(self, name, value)
    AttributeError: can't set attribute
    10:07:00.930 INFO processor.TesserocrSegment - Detected region 'region0000': 2867,801 2418,798 
    1883,799 1527,803 1527,803 1184,824 1184,824 1183,824 1183,824 1183,824 1183,824 1183,824 1183,825 
    1181,827 1180,827 1180,827 1180,827 1180,827 1180,827 1180,828 1180,828 1180,828 1180,838 1172,2362 
    1171,3063 1175,3451 1175,3451 1175,3451 1175,3452 1175,3452 1175,3452 1175,3452 1175,3452 1176,3452 
    1176,3453 1176,3453 1176,3453 1176,3453 1176,3453 1177,3453 1260,3474 1260,3474 1260,3474 1304,3474 
    1945,3458 1945,3458 3324,3389 3324,3389 3325,3389 3348,3382 3348,3382 3348,3382 3348,3382 3348,3382 
    3348,3381 3349,3381 3349,3381 3349,3381 3349,3381 3349,3381 3349,3380 3349,3380 3349,3380 3387,1134 
    3388,1069 3388,1069 3377,954 3377,954 3377,953 3377,953 3377,953 3377,953 3354,913 3354,913 
    3353,913 3353,912 3353,912 3353,912 3353,912 3130,804 3130,804 3129,804 3129,804 3129,804 
    (FLOWING_TEXT)
    ...
    ...
    ...
    Exception ignored in: <bound method BaseGeometry.__del__ of 
    <shapely.geometry.polygon.PolygonAdapter object at 0x7fc40f820710>>
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 209, in __del__
        self._empty(val=None)
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 199, in _empty
        self._is_empty = True
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/proxy.py", line 44, in __setattr__
        object.__setattr__(self, name, value)
    AttributeError: can't set attribute
    10:07:16.823 INFO processor.TesserocrSegment - Detected line 'region0005_line0010': 2366,4729 
    2366,4729 2366,4729 2291,4740 2290,4740 2290,4740 2290,4740 2290,4740 2290,4740 2290,4741 2289,4741 
    2289,4741 2289,4741 2289,4741 2289,4741 2289,4742 2289,4742 2289,4742 2289,4780 2289,4780 2289,4780 
    2289,4781 2289,4781 2289,4781 2289,4781 2289,4781 2290,4781 2290,4782 2290,4782 2290,4782 2290,4782 
    2290,4782 2291,4782 2291,4782 2291,4782 2650,4795 2895,4801 2905,4801 2905,4801 3188,4781 3188,4781 
    3189,4781 3189,4781 3189,4781 3189,4781 3189,4781 3189,4780 3190,4780 3190,4780 3190,4780 3190,4780 
    3190,4780 3190,4779 3190,4779 3190,4779 3190,4768 3190,4768 3190,4768 3190,4767 3190,4767 3190,4767 
    3190,4767 3190,4767 3189,4767 3189,4766 3189,4766 3189,4766 3189,4766 3189,4766 3188,4766 3188,4766 
    2705,4736 2705,4736 2638,4732
    Traceback (most recent call last):
      File "/usr/local/sub-venv/headless-tf2/bin/ocrd-calamari-recognize", line 33, in <module>
        sys.exit(load_entry_point('ocrd-calamari', 'console_scripts', 'ocrd-calamari-recognize')())
      File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 1128, in 
    __call__
        return self.main(*args, **kwargs)
      File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 1053, in 
    main
        rv = self.invoke(ctx)
      File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 1395, in 
    invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 754, in 
    invoke
        return __callback(*args, **kwargs)
      File "/build/ocrd_calamari/ocrd_calamari/cli.py", line 13, in ocrd_calamari_recognize
        return ocrd_cli_wrap_processor(CalamariRecognize, *args, **kwargs)
      File "/build/core/ocrd/ocrd/decorators/__init__.py", line 90, in ocrd_cli_wrap_processor
        raise Exception("Invalid input/output file grps:\n\t%s" % '\n\t'.join(report.errors))
    Exception: Invalid input/output file grps:
            Input fileGrp[@USE='OCR-D-005'] not in METS!
    ```
    opened by jbarth-ubhd 0
  • ocrd_tesserocr processors waste CPU performance because of numpy blas threads

    ocrd_tesserocr processors waste CPU performance because of numpy blas threads

    The current code imports numpy although it only uses a single function from that library. Including numpy creates a number of threads for the BLAS algorithms by default. Those threads use a lot of CPU time without doing anything useful.

    Setting the environment variable OMP_THREAD_LIMIT=1 avoids those additional threads.

    Maybe there exists a better solution which does not require an environment variable, for example removing the numpy requirement.

    opened by stweil 6
  • Problem with table recognition

    Problem with table recognition

    With tables where there are no horizontal lines, the workflow results in a wrong reading order by only recognizing the columns and no rows.
    See the following image as an example: catalog46muse_0564

    The result is as follows: OCR-D-TXT_catalog46muse_0564.txt

    This is the used workfow:

    ocrd-olena-binarize -I OCR-D-OPT -O OCR-D-BIN -p '{"impl": "sauvola-ms-split"}'
    ocrd-cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-DENOISE -p '{"level-of-operation":"page"}'
    ocrd-cis-ocropy-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p '{"level-of-operation":"page"}'
    ocrd-tesserocr-segment-region -I OCR-D-DESKEW-PAGE -O OCR-D-SEG-REG
    ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{"plausibilize":true}'
    ocrd-cis-ocropy-binarize -I OCR-D-SEG-REPAIR -O OCR-D-BIN2 -p '{"level-of-operation":"region"}'
    ocrd-tesserocr-deskew -I OCR-D-BIN2 -O OCR-D-DESKEW-TEXT
    ocrd-tesserocr-segment-line -I OCR-D-DESKEW-TEXT -O OCR-D-SEG-LINE
    ocrd-cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-RESEG
    ocrd-cis-ocropy-dewarp -I OCR-D-RESEG -O OCR-D-DEWARP-LINE
    ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"model": "deu"}'
    
    opened by Shanksum 9
Releases(v0.16.0)
  • v0.16.0(Oct 25, 2022)

  • v0.15.0(Oct 23, 2022)

    Added:

    • binarize: dpi numerical parameter to specify pixel density, #186
    • binarize: tiseg boolean parameter to specify whether to call tessapi.AnalyseLayout for text-image separation, #186

    Changed:

    • regonize: improved polygon handling, #186
    • resources: proper support for moduledir, companion to OCR-D/core#904, #187
    Source code(tar.gz)
    Source code(zip)
  • v0.14.0(Aug 14, 2022)

  • v0.13.6(Sep 28, 2021)

    Fixed:

    • segment/recognize: no find_tables when already looking for cells

    Changed:

    • segment/recognize: add param find_staves (for pageseg_apply_music_mask)
    • segment/recognize: :fire: set find_staves=false by default
    Source code(tar.gz)
    Source code(zip)
  • v0.13.5(Sep 28, 2021)

  • v0.13.4(Jul 20, 2021)

    Fixed:

    • recognize: only reset API when xpath_model or auto_model is active
    • recognize: for glyph level output, reduce choice confidence threshold
    • recognize: for glyph level output, skip choices with same text
    • recognize: avoid projecting empty text results from lower levels

    Changed:

    • recognize: allow setting init-time (model-related) parameters
    Source code(tar.gz)
    Source code(zip)
  • v0.13.3(Jul 20, 2021)

  • v0.13.2(Jul 20, 2021)

  • v0.13.1(Jul 20, 2021)

    Fixed:

    • deps-ubuntu/Docker: adapt to resmgr location mechanism, link to PPA models
    • recognize: :bug: skip detected segments if polygon cannot be made valid

    Changed:

    • deskew: add line-level operation for script detection
    • recognize: query more choices for textequiv_level=glyph if available
    • recognize: :fire: reset Tesseract API when applying model/param settings per segment
    • recognize: :eyes: allow configuring Tesseract parameters per segment via XPath queries
    • recognize: :eyes: allow selecting recognition model per segment via XPath queries
    • recognize: :eyes: allow selecting recognition model automatically via confidence
    Source code(tar.gz)
    Source code(zip)
  • v0.13.0(Jun 30, 2021)

  • v0.12.0(Mar 5, 2021)

    Changed:

    • resource lookup in a function to avoid module-level instantiation, #172
    • skip recognition of elements if they have pc:TextEquiv and overwrite_text is false-y, #170

    Added:

    • New parameter oem to explicitly set the engine backend to use, #168, #170
    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Jan 29, 2021)

  • v0.10.1(Dec 10, 2020)

    Fixed:

    • segment*/recognize: reduce minimal region height to sane value
    • segment*/recognize: also disable text recognition if model is empty
    • segment-{region,line,word}: apply only single-level segmentation again
    • segment*/recognize: skip empty non-text blocks and all-reject words

    Changed:

    • segment*/recognize: add option shrink_polygons, default to false
    • segment*/recognize: add Tesseract version to meta-data
    • recognize: add option tesseract_parameters to expose all variables
    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Dec 1, 2020)

    Fixed:

    • when padding images, add the offset to coords of new segments
    • when segmenting regions, skip empty output coords more robustly
    • deskew/segment/recognize: skip empty input images more robustly
    • crop: fix pageId of new derived image
    • recognize: fix missing RIL for terminal GetUTF8Text()
    • recognize: fix Confidence() vs MeanTextConf()

    Changed:

    • recognize: add all-in-one segmentation with flexible entry point
    • recognize: re-parameterize to segmentation_level+textequiv_level
    • recognize: :fire: rename overwrite_words to overwrite_segments
    • segment*: delegate to recognize
    • recognize: also annotate orientation and skew when segmenting regions
    • fontshape: new processor for TextStyle detection via pre-LSTM models
    • crop: also use existing text regions, if any
    • deskew: delegate to core for reflection and rotation
    • deskew: always get new image and set feature deskewed (even for 0°)
    Source code(tar.gz)
    Source code(zip)
  • v0.9.5(Oct 1, 2020)

  • v0.9.4(Sep 24, 2020)

  • v0.9.3(Sep 15, 2020)

  • v0.9.2(Sep 4, 2020)

  • v0.9.1(Aug 16, 2020)

  • v0.9.0(Aug 6, 2020)

  • v0.8.5(Jun 5, 2020)

  • v0.8.4(Jun 5, 2020)

    Changed:

    • segment-region: in sparse_text mode, also add text lines

    Fixed:

    • Always set path to TESSDATA_PREFIX for tesserocr.get_languages, #129

    Source code(tar.gz)
    Source code(zip)
  • v0.8.3(May 12, 2020)

  • v0.8.2(Apr 8, 2020)

    Fixed:

    • segment-region: no empty (invalid) ReadingOrder when no regions
    • segment-region: add sparse_text mode choice
    • segment-line: make intersection with parent more robust
    • segment-table: use SPARSE_TEXT mode for cells

    Changed:

    • Depend on OCR-D/core v2.4.4
    • Depend on sirfz/tesserocr v2.51
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Feb 17, 2020)

  • v0.8.0(Feb 17, 2020)

    Changed:

    • recognize: use lstm_choice_mode=2 for textequiv_level=glyph, #110
    • recognize: add char white/un/blacklisting parameters enhancement, #109

    Added:

    • all: add dpi parameter as manual override to image metadata enhancement, #108
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Feb 17, 2020)

    Added:

    • segment-table: new processor that adds table cells as text regions, #104
    • raw_lines option, #104
    • interprete overwrite_regions more consistently, #104
    • annotate @orientation (independent of dedicated deskewing processor) for vertical and @type for all other text blocks, #104
    • no separators and noise regions in reading order, #104

    Changed:

    • docker image built on Ubuntu 18.04, #94, #97
    • Consistent setup of docker, #97
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Nov 5, 2019)

  • v0.5.1(Nov 5, 2019)

  • v0.4.1(Oct 31, 2019)

    • Adapt to feature selection/filtering mechanism for derived images in core
    • Fixes for image-feature-related corner cases in crop and deskew
    • Use explicit (second) output fileGrp when producing derived images
    • Upgrade to upstream tesserocr 2.4.1
    • Use OCR core >= stable 1.0.0
    Source code(tar.gz)
    Source code(zip)
Owner
OCR-D
DFG-Koordinierungsprojekt zur Weiterentwicklung von Verfahren der Optical Character Recognition
OCR-D
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
A Python wrapper for the tesseract-ocr API

tesserocr A simple, Pillow-friendly, wrapper around the tesseract-ocr API for Optical Character Recognition (OCR). tesserocr integrates directly with

Fayez 1.7k Dec 31, 2022
python ocr using tesseract/ with EAST opencv detector

pytextractor python ocr using tesseract/ with EAST opencv text detector Uses the EAST opencv detector defined here with pytesseract to extract text(de

Danny Crasto 38 Dec 5, 2022
Tesseract Open Source OCR Engine (main repository)

Tesseract OCR About This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM

null 48.4k Jan 9, 2023
make a better chinese character recognition OCR than tesseract

deep ocr See README_en.md for English installation documentation. 只在ubuntu下面测试通过,需要virtualenv安装,安装路径可自行调整: git clone https://github.com/JinpengLI/deep

Jinpeng 1.5k Dec 28, 2022
Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

gosseract OCR Golang OCR package, by using Tesseract C++ library. OCR Server Do you just want OCR server, or see the working example of this package?

Hiromu OCHIAI 1.9k Dec 28, 2022
A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

About An OCR translator tool. Made by me by utilizing Tesseract, compiled to .exe using pyinstaller. I made this program to learn more about python. I

Fauzan F A 41 Dec 30, 2022
A bot that extract text from images using the Tesseract OCR.

Text from image (OCR) @ocr_text_bot A simple bot to extract text from images. Usage What do I need? A AWS key configured locally, see here. NodeJS. I

Weverton Marques 4 Aug 6, 2021
This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Script_Convertir_PDF_IMG_TXT Este script de pyhton convierte un pdf en Imagen luego utilizando tesseract como motor OCR convierte la Imagen a Texto. p

alebogado 1 Jan 27, 2022
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

null 27.5k Jan 8, 2023
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
A Python wrapper for Google Tesseract

Python Tesseract Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded i

Matthias A Lee 4.6k Jan 6, 2023
Responsive Doc. scanner using U^2-Net, Textcleaner and Tesseract

Responsive Doc. scanner using U^2-Net, Textcleaner and Tesseract Toolset U^2-Net is used for background removal Textcleaner is used for image cleaning

null 3 Jul 13, 2022
Python bindings for JIGSAW: a Delaunay-based unstructured mesh generator.

JIGSAW: An unstructured mesh generator JIGSAW is an unstructured mesh generator and tessellation library; designed to generate high-quality triangulat

Darren Engwirda 26 Dec 13, 2022
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

Jaided AI 16.7k Jan 3, 2023
FastOCR is a desktop application for OCR API.

FastOCR FastOCR is a desktop application for OCR API. Installation Arch Linux fastocr-git @ AUR Build from AUR or install with your favorite AUR helpe

Bruce Zhang 58 Jan 7, 2023
OCR-D-compliant page segmentation

ocrd_segment This repository aims to provide a number of OCR-D-compliant processors for layout analysis and evaluation. Installation In your virtual e

OCR-D 59 Sep 10, 2022
OCR software for recognition of handwritten text

Handwriting OCR The project tries to create software for recognition of a handwritten text from photos (also for Czech language). It uses computer vis

Břetislav Hájek 562 Jan 3, 2023
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023