Run tesseract with the tesserocr bindings with @OCR-D's interfaces

OCR-D

Last update: Oct 14, 2022

Related tags

Computer Vision ocr-d

Overview

ocrd_tesserocr

Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr

Introduction

This package offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr. (Each processor is a parameterizable step in a configurable workflow of the OCR-D functional model. There are usually various alternative processor implementations for each step. Data is represented with METS and PAGE.)

It includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation), script identification, font style recognition and text recognition.

Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. In PAGE, image results are referenced (read and written) via AlternativeImage, text results via TextEquiv, font attributes via TextStyle, script via @primaryScript, deskewing via @orientation, cropping via Border and segmentation via Region / TextLine / Word elements with Coords/@points.

Installation

Required ubuntu packages:

Tesseract headers (libtesseract-dev)
Some Tesseract language models (tesseract-ocr-{eng,deu,frk,...} or script models (tesseract-ocr-script-{latn,frak,...}); or better yet custom trained models
Leptonica headers (libleptonica-dev)

From PyPI

This is the best option if you want to use the stable, released version.

NOTE

ocrd_tesserocr requires Tesseract >= 4.1.0. The Tesseract packages bundled with Ubuntu < 19.10 are too old. If you are on Ubuntu 18.04 LTS, please use Alexander Pozdnyakov's PPA repository, which has up-to-date builds of Tesseract and its dependencies:

sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update

sudo apt-get install git python3 python3-pip libtesseract-dev libleptonica-dev tesseract-ocr-eng tesseract-ocr wget
pip install ocrd_tesserocr

With docker

This is the best option if you want to run the software in a container.

You need to have Docker

docker pull ocrd/tesserocr

To run with docker:

docker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...

From git

This is the best option if you want to change the source code or install the latest, unpublished changes.

We strongly recommend to use venv.

git clone https://github.com/OCR-D/ocrd_tesserocr
cd ocrd_tesserocr
sudo make deps-ubuntu # or manually with apt-get
make deps        # or pip install -r requirements
make install     # or pip install .

Usage

For details, see docstrings in the individual processors and ocrd-tool.json descriptions, or simply --help.

Available OCR-D processors are:

ocrd-tesserocr-crop (simplistic)
- sets Border of pages and adds AlternativeImage files to the output fileGrp
ocrd-tesserocr-deskew (for skew and orientation; mind operation_level)
- sets @orientation of regions or pages and adds AlternativeImage files to the output fileGrp
ocrd-tesserocr-binarize (Otsu – not recommended)
- adds AlternativeImage files to the output fileGrp
ocrd-tesserocr-recognize (optionally including segmentation; mind segmentation_level and textequiv_level)
- adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation (optionally)
- adds TextRegions to TableRegions and sets their @orientation (optionally)
- adds TextLines to TextRegions (optionally)
- adds Words to TextLines (optionally)
- adds Glyphs to Words (optionally)
- adds TextEquiv
ocrd-tesserocr-segment (all-in-one segmentation – recommended; delegates to recognize)
- adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation
- adds TextRegions to TableRegions and sets their @orientation
- adds TextLines to TextRegions
- adds Words to TextLines
- adds Glyphs to Words
ocrd-tesserocr-segment-region (only regions – with overlapping bboxes; delegates to recognize)
- adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation
ocrd-tesserocr-segment-table (only table cells; delegates to recognize)
- adds TextRegions to TableRegions
ocrd-tesserocr-segment-line (only lines – from overlapping regions; delegates to recognize)
- adds TextLines to TextRegions
ocrd-tesserocr-segment-word (only words; delegates to recognize)
- adds Words to TextLines
ocrd-tesserocr-fontshape (only text style – via Tesseract 3 models)
- adds TextStyle to Words

The text region @types detected are (from Tesseract's PolyBlockType):

paragraph: normal block (aligned with others in the column)
floating: unaligned block (is in a cross-column pull-out region)
heading: block that spans more than one column
caption: block for text that belongs to an image

If you are unhappy with these choices, consider post-processing with a dedicated custom processor in Python, or by modifying the PAGE files directly (e.g. xmlstarlet ed --inplace -u '//pc:TextRegion/@type[.="floating"]' -v paragraph filegrp/*.xml).

All segmentation is currently done as bounding boxes only by default, i.e. without precise polygonal outlines. For dense page layouts this means that neighbouring regions and neighbouring text lines may overlap a lot. If this is a problem for your workflow, try post-processing like so:

after line segmentation: use ocrd-cis-ocropy-resegment for polygonalization, or ocrd-cis-ocropy-clip on the line level
after region segmentation: use ocrd-segment-repair with plausibilize (and sanitize after line segmentation)

It also means that Tesseract should be allowed to segment across multiple hierarchy levels at once, to avoid introducing inconsistent/duplicate text line assignments in text regions, or word assignments in text lines. Hence,

prefer ocrd-tesserocr-recognize with segmentation_level=region over ocrd-tesserocr-segment followed by ocrd-tesserocr-recognize, if you want to do all in one with Tesseract,
prefer ocrd-tesserocr-recognize with segmentation_level=line over ocrd-tesserocr-segment-line followed by ocrd-tesserocr-recognize, if you want to do everything but region segmentation with Tesseract,
prefer ocrd-tesserocr-segment over ocrd-tesserocr-segment-region followed by (ocrd-tesserocr-segment-table and) ocrd-tesserocr-segment-line, if you want to do everything but recognition with Tesseract.

However, you can also run ocrd-tesserocr-segment* and ocrd-tesserocr-recognize with shrink_polygons=True to get polygons by post-processing each segment, shrinking to the convex hull of all its symbol outlines.

Testing

make test

This downloads some test data from https://github.com/OCR-D/assets under repo/assets, and runs some basic test of the Python API as well as the CLIs.

Set PYTEST_ARGS="-s --verbose" to see log output (-s) and individual test results (--verbose).

Comments

Add fontshape processor and all-in-one segmentation

We can probably remove both the old segment-region/line/word and new (all-in-one) segment altogether now that we can configure them via overwrite_* and textequiv_level in recognize. Or we keep the CLI names, but delegate to recognize @kba?

opened by bertsky 58
Memory leaks

The memory usage of ocrd-tesserocr-segment-region increases for each page, resulting in a total of about 7 GB for 200 pages, 8 GB for 248 pages, 10 GB for 282 pages, 11 GB for 313 pages (observed for http://nbn-resolving.de/urn:nbn:de:bsz:180-digad-22977).

ocrd-tesserocr-segment-line shows a similar effect.

For that book, a machine with 8 GB RAM would have started swapping, thus slowing down the process extremely. Even a large server would get memory problems when processing large books with more than 1000 pages in parallel.

opened by stweil 31
improve segmentation
This fixes #101 (using raw_lines by default for textline images, but there are still some corner cases that need to be fixed in Tesseract) and brings a number of segmentation-related improvements:

interprete overwrite_regions more consistently

annotate @orientation (independent of dedicated deskewing processor) for vertical and @type for all other text blocks

no separators and noise regions in reading order

segment tables into cells and lines so they can be OCRed, too
opened by bertsky 28

Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid)

21:19:10.443 INFO processor.TesserocrSegmentLine - INPUT FILE 65 / phys396119
21:19:10.577 INFO processor.TesserocrSegmentLine - Page 'phys396119' images will use DPI estimated from segmentation
21:19:10.850 ERROR shapely.geos - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 0 107 at 0 107
Traceback (most recent call last):
  File "/home/stweil/src/github/OCR-D/venv-20200904/bin/ocrd-tesserocr-segment-line", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_line())
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 26, in ocrd_tesserocr_segment_line
    return ocrd_cli_wrap_processor(TesserocrSegmentLine, *args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
    processor.process()
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/segment_line.py", line 119, in process
    interline = line_poly.intersection(region_poly)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/geometry/base.py", line 676, in intersection
    return geom_factory(self.impl['intersection'](self, other))
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/topology.py", line 70, in __call__
    self._check_topology(err, this, other)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/topology.py", line 38, in _check_topology
    self.fn.__name__, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f89253f7c88>

opened by stweil 23

Integration with OCR-D/spec#169 (resource manager)

This is just a proof-of-concept that it is possible to load tesseract models installed with ocrd resmgr into the cache directory.

The tricky part here is that there is only one TESSDATA_PREFIX but potentially multiple directories with models. So while it is no problem to look up models in various folders, only one of the can be used as the TESSDATA_PREFIX. Suggestions for a reasonable resolution to this dilemma are welcome.

opened by kba 22
support more textequiv levels

This is an attempt to implement the other annotation levels. In my opinion, the behaviour for the different levels cannot be made completely analogous with Tesseract: simply pointing it to rectangles for words and glyphs (from an external layout segmentation) would produce results of far worse quality than always recognizing one complete line and allowing its own segmentation below it (accessible by iterators). In contrast, from the line level upwards we can reliably use its respective page segmentation mode (SINGLE_LINE / SINGLE_BLOCK / AUTO). Perhaps warnings and exceptions should be dealt with in a different, more systematic way though.

opened by bertsky 18
move to AlternativeImage feature selectors in OCR-D/core#294:
all: use second output position as fileGrp USE to produce AlternativeImage

all: rid of MetadataItem/Labels-related FIXME: with the updated PAGE model, we can now use @externalModel and @externalId

all: use OcrdExif.resolution instead of xResolution

all: create images with monotonically growing @comments (features)

crop: use ocrd_utils.crop_image instead of PIL.Image.crop

crop: fix bug when trying to access page_image if there are already region coordinates that we are ignoring

crop: filter images already deskewed and cropped! (we must crop ourselves, and deskewing can not happen until afterwards)

deskew: fix bugs in configuration-dependent corner cases related to whether deskewing has already been applied (on the page or region level):

for the page image, never use images already rotated (both for page level and region level processing, but for the region level, do rotate images ad hoc if @orientation is present on the page level)

for the region image, never use images already rotated (except for our own page-level rotation)

segment-region: forgot to add feature "cropped" when producing cropped images

bug enhancement
opened by bertsky 16

pip install ocrd_tesserocr fails with tesseract version 4.0.0-beta-26-gfd49

I use pip install ocrd_tesserocr to install ocrd_tesseract into my virtualenv environment. The installation fails with:

...
  Running setup.py bdist_wheel for tesserocr ... error
  Complete output from command /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-k_dgo547/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-q7mozwr8 --python-tag cp37:
  Supporting tesseract v4.0.0
  Configs from pkg-config: {'include_dirs': ['/usr/include'], 'libraries': ['tesseract', 'lept'], 'cython_compile_time_env': {'TESSERACT_VERSION': 67108864}}
  running bdist_wheel
  running build
  running build_ext
  building 'tesserocr' extension
  creating build
  creating build/temp.linux-x86_64-3.7
  gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -fPIC -I/usr/include -I/usr/include/python3.7m -c tesserocr.cpp -o build/temp.linux-x86_64-3.7/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
  tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_16PyResultIterator_8GetBestLSTMSymbolChoices(__pyx_obj_9tesserocr_PyResultIterator*)':
  tesserocr.cpp:12196:43: error: 'class tesseract::ResultIterator' has no member named 'GetBestLSTMSymbolChoices'
     __pyx_v_output = (__pyx_v_self->_riter->GetBestLSTMSymbolChoices()[0]);
                                             ^~~~~~~~~~~~~~~~~~~~~~~~
  error: command 'gcc' failed with exit status 1

  ----------------------------------------
  Failed building wheel for tesserocr
  Running setup.py clean for tesserocr
Failed to build tesserocr
Installing collected packages: tesserocr, ocrd-tesserocr
  Running setup.py install for tesserocr ... error
    Complete output from command /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-k_dgo547/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-fc87h61b/install-record.txt --single-version-externally-managed --compile --install-headers /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/include/site/python3.7/tesserocr:
    Supporting tesseract v4.0.0
    Configs from pkg-config: {'include_dirs': ['/usr/include'], 'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 67108864}}
    running install
    running build
    running build_ext
    building 'tesserocr' extension
    creating build
    creating build/temp.linux-x86_64-3.7
    gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -fPIC -I/usr/include -I/usr/include/python3.7m -c tesserocr.cpp -o build/temp.linux-x86_64-3.7/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
    tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_16PyResultIterator_8GetBestLSTMSymbolChoices(__pyx_obj_9tesserocr_PyResultIterator*)':
    tesserocr.cpp:12196:43: error: 'class tesseract::ResultIterator' has no member named 'GetBestLSTMSymbolChoices'
       __pyx_v_output = (__pyx_v_self->_riter->GetBestLSTMSymbolChoices()[0]);
                                               ^~~~~~~~~~~~~~~~~~~~~~~~
    error: command 'gcc' failed with exit status 1
...

tesseract is installed on the system:

tesseract 4.0.0-beta.4-26-gfd49
 leptonica-1.77.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.1) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1
 Found AVX
 Found SSE

opened by finkf 16

Superfluous newlines

At the moment, superfluous newlines are appended to the TextEquiv/Unicode entries:

                    <pc:TextEquiv>
                        <pc:Unicode>Groſzmaͤchtigſter</pc:Unicode>
                    </pc:TextEquiv>
                    <pc:TextEquiv>
                        <pc:Unicode>stzmächtigstcr
</pc:Unicode>

opened by finkf 16

Make it clearer which Tesseract engine is being used
Since Tesseract 4, two OCR engines are available: rule-based (i.e. --oem 0), LSTM (--oem 1). The command-line also exposes an ensemble of the two OCR engines (--oem 2). The documentation for ocrd-tesserocr-recognize does not make it clear which engine is used and using either the following parameters seems to have no effect on the recognition results:

-P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "0" }'

-P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "1" }'

-P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "2" }'

Which one of the OCR engines are we currently using?
opened by Witiko 12

ocrd-tesserocr-segment: segmentation fault

And with this image:

https://digi.ub.uni-heidelberg.de/diglitData/v/justinian1627bd2_-_1281.tif

and ocrd.sif (singularity container) created from docker ocrd_all at Nov 9 10:13 2021 & at Jan 17 15:11 2022 [UPDATE]

and this workflow:

/usr/bin/time singularity exec $HOME/ocrd.sif ocrd workspace init >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff OCR-D-IMG/00001.tif >>ocrd.log 2>&1 || exit

/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-olena-binarize -P k 0.10 -I OCR-D-IMG -O OCR-D-001 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-anybaseocr-crop -I OCR-D-001 -O OCR-D-002 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-olena-binarize -I OCR-D-002 -O OCR-D-003 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D-004 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true -I OCR-D-004 -O OCR-D-005 >>ocrd.log 2>&1 || exit
/usr/bin/time singularity exec $HOME/ocrd.sif ocrd-calamari-recognize -I OCR-D-005 -O OCR-D-OCR -P checkpoint "$HOME/ocrd_models/calamari/calamari_models_experimental/historical_french_2020-10-14/*.ckpt.json" >>ocrd.log 2>&1 || exit

I'll get a segmentation fault

Core was generated by `/usr/bin/python3 /usr/bin/ocrd-tesserocr-segment -P find_tables false -P shrink'.
Program terminated with signal 11, Segmentation fault.

opened by jbarth-ubhd 11

reverse order of glyphs inside words in PAGE-File for RTL languages
when using for example Arabic model, recognition works fine but the words inside the generated PAGE-XML contains reversed letters. But the sequence of words itself is correct, here an example: generated word with wrong sequence of letters:

<pc:Word id="region0001_line0001_word0000"> <pc:Coords points="1620,372 1620,402 1703,402 1703,375 1647,376"/> <pc:TextEquiv conf="0.877831573486328"> <pc:Unicode>رصم</pc:Unicode> </pc:TextEquiv> </pc:Word>

but the line containing the recogized word should look like this:

<pc:Unicode>مصر</pc:Unicode>

(I know it is not easy to see clearly that it is reversed because the letters in Arabic changes appearance depending on position inside word, but this is handled by font.)

Here is the equivalent portion of the image:

REMARK: when using tesseract as standalone and generating alto, the sequence is correct!
opened by MihoMahi 3

montfaucon1719bd2_1, page 210, ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true

this image

https://digi.ub.uni-heidelberg.de/diglitData/v/montfaucon1719bd2_1.210.tif

UPDATE same for https://digi.ub.uni-heidelberg.de/diglitData/v/montfaucon1719bd2_1.168a_Planche_72.tif

with this workflow (latest ocrd_all as of 2021-12-01)

ocrd workspace init 
ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff OCR-D-IMG/00001.tif 

ocrd-olena-binarize -P k 0.10 -I OCR-D-IMG -O OCR-D-001 
ocrd-anybaseocr-crop -I OCR-D-001 -O OCR-D-002 
ocrd-olena-binarize -I OCR-D-002 -O OCR-D-003 
ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D-004 
ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true -I OCR-D-004 -O OCR-D-005 
ocrd-calamari-recognize -I OCR-D-005 -O OCR-D-OCR -P checkpoint "$HOME/ocrd/_models/ocrd-calamari-recognize/c1_latin-script-hist-3/*.ckpt.json"

leads to this error messages:

10:06:58.121 INFO processor.TesserocrSegment - INPUT FILE 0 / P_00001
10:06:59.193 INFO processor.TesserocrSegment - Page 'P_00001' images will use 333 DPI from image 
meta-data
10:06:59.193 INFO processor.TesserocrSegment - Processing page 'P_00001'
10:07:00.229 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-005_00001.IMG-BIN, 
file_grp: OCR-D-005, path: OCR-D-005/OCR-D-005_00001.IMG-BIN.png
/build/ocrd_tesserocr/ocrd_tesserocr/recognize.py:510: ShapelyDeprecationWarning: The proxy 
geometries (through the 'asShape()', 'asPolygon()' or 'PolygonAdapter()' constructors) are 
deprecated and will be removed in Shapely 2.0. Use the 'shape()' function or the standard 
'Polygon()' constructor instead.
  for symbol in iterate_level(it, RIL.SYMBOL, parent=RIL.BLOCK)])
Exception ignored in: <bound method BaseGeometry.__del__ of 
<shapely.geometry.polygon.PolygonAdapter object at 0x7fc431060358>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 209, in __del__
    self._empty(val=None)
  File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 199, in _empty
    self._is_empty = True
  File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/proxy.py", line 44, in __setattr__
    object.__setattr__(self, name, value)
AttributeError: can't set attribute
10:07:00.930 INFO processor.TesserocrSegment - Detected region 'region0000': 2867,801 2418,798 
1883,799 1527,803 1527,803 1184,824 1184,824 1183,824 1183,824 1183,824 1183,824 1183,824 1183,825 
1181,827 1180,827 1180,827 1180,827 1180,827 1180,827 1180,828 1180,828 1180,828 1180,838 1172,2362 
1171,3063 1175,3451 1175,3451 1175,3451 1175,3452 1175,3452 1175,3452 1175,3452 1175,3452 1176,3452 
1176,3453 1176,3453 1176,3453 1176,3453 1176,3453 1177,3453 1260,3474 1260,3474 1260,3474 1304,3474 
1945,3458 1945,3458 3324,3389 3324,3389 3325,3389 3348,3382 3348,3382 3348,3382 3348,3382 3348,3382 
3348,3381 3349,3381 3349,3381 3349,3381 3349,3381 3349,3381 3349,3380 3349,3380 3349,3380 3387,1134 
3388,1069 3388,1069 3377,954 3377,954 3377,953 3377,953 3377,953 3377,953 3354,913 3354,913 
3353,913 3353,912 3353,912 3353,912 3353,912 3130,804 3130,804 3129,804 3129,804 3129,804 
(FLOWING_TEXT)
...
...
...
Exception ignored in: <bound method BaseGeometry.__del__ of 
<shapely.geometry.polygon.PolygonAdapter object at 0x7fc40f820710>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 209, in __del__
    self._empty(val=None)
  File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 199, in _empty
    self._is_empty = True
  File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/proxy.py", line 44, in __setattr__
    object.__setattr__(self, name, value)
AttributeError: can't set attribute
10:07:16.823 INFO processor.TesserocrSegment - Detected line 'region0005_line0010': 2366,4729 
2366,4729 2366,4729 2291,4740 2290,4740 2290,4740 2290,4740 2290,4740 2290,4740 2290,4741 2289,4741 
2289,4741 2289,4741 2289,4741 2289,4741 2289,4742 2289,4742 2289,4742 2289,4780 2289,4780 2289,4780 
2289,4781 2289,4781 2289,4781 2289,4781 2289,4781 2290,4781 2290,4782 2290,4782 2290,4782 2290,4782 
2290,4782 2291,4782 2291,4782 2291,4782 2650,4795 2895,4801 2905,4801 2905,4801 3188,4781 3188,4781 
3189,4781 3189,4781 3189,4781 3189,4781 3189,4781 3189,4780 3190,4780 3190,4780 3190,4780 3190,4780 
3190,4780 3190,4779 3190,4779 3190,4779 3190,4768 3190,4768 3190,4768 3190,4767 3190,4767 3190,4767 
3190,4767 3190,4767 3189,4767 3189,4766 3189,4766 3189,4766 3189,4766 3189,4766 3188,4766 3188,4766 
2705,4736 2705,4736 2638,4732
Traceback (most recent call last):
  File "/usr/local/sub-venv/headless-tf2/bin/ocrd-calamari-recognize", line 33, in <module>
    sys.exit(load_entry_point('ocrd-calamari', 'console_scripts', 'ocrd-calamari-recognize')())
  File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 1128, in 
__call__
    return self.main(*args, **kwargs)
  File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 1053, in 
main
    rv = self.invoke(ctx)
  File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 1395, in 
invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 754, in 
invoke
    return __callback(*args, **kwargs)
  File "/build/ocrd_calamari/ocrd_calamari/cli.py", line 13, in ocrd_calamari_recognize
    return ocrd_cli_wrap_processor(CalamariRecognize, *args, **kwargs)
  File "/build/core/ocrd/ocrd/decorators/__init__.py", line 90, in ocrd_cli_wrap_processor
    raise Exception("Invalid input/output file grps:\n\t%s" % '\n\t'.join(report.errors))
Exception: Invalid input/output file grps:
        Input fileGrp[@USE='OCR-D-005'] not in METS!
```

opened by jbarth-ubhd 0

ocrd_tesserocr processors waste CPU performance because of numpy blas threads

The current code imports numpy although it only uses a single function from that library. Including numpy creates a number of threads for the BLAS algorithms by default. Those threads use a lot of CPU time without doing anything useful.

Setting the environment variable OMP_THREAD_LIMIT=1 avoids those additional threads.

Maybe there exists a better solution which does not require an environment variable, for example removing the numpy requirement.

opened by stweil 6

Problem with table recognition

With tables where there are no horizontal lines, the workflow results in a wrong reading order by only recognizing the columns and no rows.
See the following image as an example: catalog46muse_0564

The result is as follows: OCR-D-TXT_catalog46muse_0564.txt

This is the used workfow:

ocrd-olena-binarize -I OCR-D-OPT -O OCR-D-BIN -p '{"impl": "sauvola-ms-split"}'
ocrd-cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-DENOISE -p '{"level-of-operation":"page"}'
ocrd-cis-ocropy-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p '{"level-of-operation":"page"}'
ocrd-tesserocr-segment-region -I OCR-D-DESKEW-PAGE -O OCR-D-SEG-REG
ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{"plausibilize":true}'
ocrd-cis-ocropy-binarize -I OCR-D-SEG-REPAIR -O OCR-D-BIN2 -p '{"level-of-operation":"region"}'
ocrd-tesserocr-deskew -I OCR-D-BIN2 -O OCR-D-DESKEW-TEXT
ocrd-tesserocr-segment-line -I OCR-D-DESKEW-TEXT -O OCR-D-SEG-LINE
ocrd-cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-RESEG
ocrd-cis-ocropy-dewarp -I OCR-D-RESEG -O OCR-D-DEWARP-LINE
ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"model": "deu"}'

opened by Shanksum 9

Releases(v0.16.0)

v0.16.0(Oct 25, 2022)
Changed:

require newer OCR-D/core to include OCR-D/core#934, #188

no more need to set TESSDATA_PREFIX

improved and up-to-date README

Source code(tar.gz)
Source code(zip)
v0.15.0(Oct 23, 2022)
Added:

binarize: dpi numerical parameter to specify pixel density, #186

binarize: tiseg boolean parameter to specify whether to call tessapi.AnalyseLayout for text-image separation, #186

Changed:

regonize: improved polygon handling, #186

resources: proper support for moduledir, companion to OCR-D/core#904, #187

Source code(tar.gz)
Source code(zip)
v0.14.0(Aug 14, 2022)
Changed:

list all resources in the ocrd-tool.json, #184, OCR-D/core#800

custom --list-resources handler, #176

Source code(tar.gz)
Source code(zip)
v0.13.6(Sep 28, 2021)
Fixed:

segment/recognize: no find_tables when already looking for cells

Changed:

segment/recognize: add param find_staves (for pageseg_apply_music_mask)

segment/recognize: :fire: set find_staves=false by default

Source code(tar.gz)
Source code(zip)
v0.13.5(Sep 28, 2021)
Fixed:

recognize: prevent invalid empty Unicode glyph choices

Source code(tar.gz)
Source code(zip)
v0.13.4(Jul 20, 2021)
Fixed:

recognize: only reset API when xpath_model or auto_model is active

recognize: for glyph level output, reduce choice confidence threshold

recognize: for glyph level output, skip choices with same text

recognize: avoid projecting empty text results from lower levels

Changed:

recognize: allow setting init-time (model-related) parameters

Source code(tar.gz)
Source code(zip)
v0.13.3(Jul 20, 2021)
Changed:

recognize: on glyph level, fall back to RIL.SYMBOL if ChoiceIterator is empty

Source code(tar.gz)
Source code(zip)
v0.13.2(Jul 20, 2021)
Fixed:

updated requirements

Source code(tar.gz)
Source code(zip)
v0.13.1(Jul 20, 2021)
Fixed:

deps-ubuntu/Docker: adapt to resmgr location mechanism, link to PPA models

recognize: :bug: skip detected segments if polygon cannot be made valid

Changed:

deskew: add line-level operation for script detection

recognize: query more choices for textequiv_level=glyph if available

recognize: :fire: reset Tesseract API when applying model/param settings per segment

recognize: :eyes: allow configuring Tesseract parameters per segment via XPath queries

recognize: :eyes: allow selecting recognition model per segment via XPath queries

recognize: :eyes: allow selecting recognition model automatically via confidence

Source code(tar.gz)
Source code(zip)
v0.13.0(Jun 30, 2021)
Changed:

segment*/recognize: annotate clipped,binarized AlternativeImage on page level

binarize: add page level, make default

Source code(tar.gz)
Source code(zip)
v0.12.0(Mar 5, 2021)
Changed:

resource lookup in a function to avoid module-level instantiation, #172

skip recognition of elements if they have pc:TextEquiv and overwrite_text is false-y, #170

Added:

New parameter oem to explicitly set the engine backend to use, #168, #170

Source code(tar.gz)
Source code(zip)
v0.11.0(Jan 29, 2021)
Changed:

Models are resolved via OCR-D/core resource manager default location ($XDG_DATA_HOME) or $TESSDATA_PREFIX, #166

Source code(tar.gz)
Source code(zip)
v0.10.1(Dec 10, 2020)
Fixed:

segment*/recognize: reduce minimal region height to sane value

segment*/recognize: also disable text recognition if model is empty

segment-{region,line,word}: apply only single-level segmentation again

segment*/recognize: skip empty non-text blocks and all-reject words

Changed:

segment*/recognize: add option shrink_polygons, default to false

segment*/recognize: add Tesseract version to meta-data

recognize: add option tesseract_parameters to expose all variables

Source code(tar.gz)
Source code(zip)
v0.10.0(Dec 1, 2020)
Fixed:

when padding images, add the offset to coords of new segments

when segmenting regions, skip empty output coords more robustly

deskew/segment/recognize: skip empty input images more robustly

crop: fix pageId of new derived image

recognize: fix missing RIL for terminal GetUTF8Text()

recognize: fix Confidence() vs MeanTextConf()

Changed:

recognize: add all-in-one segmentation with flexible entry point

recognize: re-parameterize to segmentation_level+textequiv_level

recognize: :fire: rename overwrite_words to overwrite_segments

segment*: delegate to recognize

recognize: also annotate orientation and skew when segmenting regions

fontshape: new processor for TextStyle detection via pre-LSTM models

crop: also use existing text regions, if any

deskew: delegate to core for reflection and rotation

deskew: always get new image and set feature deskewed (even for 0°)

Source code(tar.gz)
Source code(zip)
v0.9.5(Oct 1, 2020)
Fixed:

logging according to OCR-D/core#599 (again)

Source code(tar.gz)
Source code(zip)
v0.9.4(Sep 24, 2020)
Fixed:

recognize: be robust to different input image modes, Pillow#4925

logging according to https://github.com/OCR-D/core/pull/599

Source code(tar.gz)
Source code(zip)
v0.9.3(Sep 15, 2020)
Fixed:

segmentation: ensure new elements fit into their parent coords

segmentation: ensure valid coords

Source code(tar.gz)
Source code(zip)
v0.9.2(Sep 4, 2020)
Fixed:

segment-region: just ignore region outside of page frame, #145

deskew: add suffix to AlternativeImage file ID, #148

Source code(tar.gz)
Source code(zip)
v0.9.1(Aug 16, 2020)
Fixed:

crop: allow running on deskewed page, clip Border to original frame

deskew: refactoring artefact from #133, #142

Source code(tar.gz)
Source code(zip)
v0.9.0(Aug 6, 2020)
Changed:

All processors write to a single file group, #133

All processors set pg:PcGts/pcGtsId to file_id consistently, #136

Source code(tar.gz)
Source code(zip)
v0.8.5(Jun 5, 2020)
Fixed:

segment-region: ensure polygons are within page/Border

Source code(tar.gz)
Source code(zip)
v0.8.4(Jun 5, 2020)

Changed:

• segment-region: in sparse_text mode, also add text lines

Fixed:

• Always set path to TESSDATA_PREFIX for tesserocr.get_languages, #129
Source code(tar.gz)
Source code(zip)
v0.8.3(May 12, 2020)
Fixed:

recognize: ignore empty RO group

Changed:

recognize: add padding parameter

Source code(tar.gz)
Source code(zip)
v0.8.2(Apr 8, 2020)
Fixed:

segment-region: no empty (invalid) ReadingOrder when no regions

segment-region: add sparse_text mode choice

segment-line: make intersection with parent more robust

segment-table: use SPARSE_TEXT mode for cells

Changed:

Depend on OCR-D/core v2.4.4

Depend on sirfz/tesserocr v2.51

Source code(tar.gz)
Source code(zip)
v0.8.1(Feb 17, 2020)
Fixed:

recognize: fix buggy RTL behavior, glyph confidence defaults to 1, #112, #113

Source code(tar.gz)
Source code(zip)
v0.8.0(Feb 17, 2020)
Changed:

recognize: use lstm_choice_mode=2 for textequiv_level=glyph, #110

recognize: add char white/un/blacklisting parameters enhancement, #109

Added:

all: add dpi parameter as manual override to image metadata enhancement, #108

Source code(tar.gz)
Source code(zip)
v0.7.0(Feb 17, 2020)
Added:

segment-table: new processor that adds table cells as text regions, #104

raw_lines option, #104

interprete overwrite_regions more consistently, #104

annotate @orientation (independent of dedicated deskewing processor) for vertical and @type for all other text blocks, #104

no separators and noise regions in reading order, #104

Changed:

docker image built on Ubuntu 18.04, #94, #97

Consistent setup of docker, #97

Source code(tar.gz)
Source code(zip)
v0.6.0(Nov 5, 2019)
Changed:

Depend on OCR-D/core v2.0.0

Source code(tar.gz)
Source code(zip)
v0.5.1(Nov 5, 2019)
Fixed:

Correct version in ocrd-tool.json, #76

Source code(tar.gz)
Source code(zip)
v0.4.1(Oct 31, 2019)
Adapt to feature selection/filtering mechanism for derived images in core

Fixes for image-feature-related corner cases in crop and deskew

Use explicit (second) output fileGrp when producing derived images

Upgrade to upstream tesserocr 2.4.1

Use OCR core >= stable 1.0.0

Source code(tar.gz)
Source code(zip)

Owner

OCR-D

DFG-Koordinierungsprojekt zur Weiterentwicklung von Verfahren der Optical Character Recognition

GitHub

Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

5 Dec 6, 2021

A Python wrapper for the tesseract-ocr API

tesserocr A simple, Pillow-friendly, wrapper around the tesseract-ocr API for Optical Character Recognition (OCR). tesserocr integrates directly with

1.7k Dec 31, 2022

python ocr using tesseract/ with EAST opencv detector

pytextractor python ocr using tesseract/ with EAST opencv text detector Uses the EAST opencv detector defined here with pytesseract to extract text(de

38 Dec 5, 2022

Tesseract Open Source OCR Engine (main repository)

Tesseract OCR About This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM

48.4k Jan 9, 2023

make a better chinese character recognition OCR than tesseract

deep ocr See README_en.md for English installation documentation. 只在ubuntu下面测试通过，需要virtualenv安装，安装路径可自行调整： git clone https://github.com/JinpengLI/deep

1.5k Dec 28, 2022

Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

gosseract OCR Golang OCR package, by using Tesseract C++ library. OCR Server Do you just want OCR server, or see the working example of this package?

1.9k Dec 28, 2022

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

About An OCR translator tool. Made by me by utilizing Tesseract, compiled to .exe using pyinstaller. I made this program to learn more about python. I

41 Dec 30, 2022

A bot that extract text from images using the Tesseract OCR.

Text from image (OCR) @ocr_text_bot A simple bot to extract text from images. Usage What do I need? A AWS key configured locally, see here. NodeJS. I

4 Aug 6, 2021

This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Script_Convertir_PDF_IMG_TXT Este script de pyhton convierte un pdf en Imagen luego utilizando tesseract como motor OCR convierte la Imagen a Texto. p

1 Jan 27, 2022

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

27.5k Jan 8, 2023

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

311 Dec 24, 2022

A Python wrapper for Google Tesseract

Python Tesseract Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded i

4.6k Jan 6, 2023

Responsive Doc. scanner using U^2-Net, Textcleaner and Tesseract

Responsive Doc. scanner using U^2-Net, Textcleaner and Tesseract Toolset U^2-Net is used for background removal Textcleaner is used for image cleaning

3 Jul 13, 2022

Python bindings for JIGSAW: a Delaunay-based unstructured mesh generator.

JIGSAW: An unstructured mesh generator JIGSAW is an unstructured mesh generator and tessellation library; designed to generate high-quality triangulat

26 Dec 13, 2022

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

16.7k Jan 3, 2023

Run tesseract with the tesserocr bindings with @OCR-D's interfaces

Related tags

Overview

ocrd_tesserocr

Introduction

Installation

Required ubuntu packages:

From PyPI

With docker

From git

Usage

Testing

Comments

Releases(v0.16.0)

v0.16.0(Oct 25, 2022)

v0.15.0(Oct 23, 2022)

v0.14.0(Aug 14, 2022)

v0.13.6(Sep 28, 2021)

v0.13.5(Sep 28, 2021)

v0.13.4(Jul 20, 2021)

v0.13.3(Jul 20, 2021)

v0.13.2(Jul 20, 2021)

v0.13.1(Jul 20, 2021)

v0.13.0(Jun 30, 2021)

v0.12.0(Mar 5, 2021)

v0.11.0(Jan 29, 2021)

v0.10.1(Dec 10, 2020)

v0.10.0(Dec 1, 2020)

v0.9.5(Oct 1, 2020)

v0.9.4(Sep 24, 2020)

v0.9.3(Sep 15, 2020)

v0.9.2(Sep 4, 2020)

v0.9.1(Aug 16, 2020)

v0.9.0(Aug 6, 2020)

v0.8.5(Jun 5, 2020)

v0.8.4(Jun 5, 2020)

v0.8.3(May 12, 2020)

v0.8.2(Apr 8, 2020)

v0.8.1(Feb 17, 2020)

v0.8.0(Feb 17, 2020)

v0.7.0(Feb 17, 2020)

v0.6.0(Nov 5, 2019)

v0.5.1(Nov 5, 2019)

v0.4.1(Oct 31, 2019)

Owner

OCR-D

Indonesian ID Card OCR using tesseract OCR

A Python wrapper for the tesseract-ocr API

python ocr using tesseract/ with EAST opencv detector

Tesseract Open Source OCR Engine (main repository)

make a better chinese character recognition OCR than tesseract

Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

A bot that extract text from images using the Tesseract OCR.

This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

A Python wrapper for Google Tesseract

Responsive Doc. scanner using U^2-Net, Textcleaner and Tesseract

Python bindings for JIGSAW: a Delaunay-based unstructured mesh generator.

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

FastOCR is a desktop application for OCR API.

OCR-D-compliant page segmentation

OCR software for recognition of handwritten text

Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition