Python-based tools for document analysis and OCR

OCRopus

Last update: Dec 31, 2022

Related tags

Computer Vision ocropy

Overview

ocropy

OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do some image preprocessing, and possibly also train new models.

In addition to the recognition scripts themselves, there are a number of scripts for ground truth editing and correction, measuring error rates, determining confusion matrices, etc. OCRopus commands will generally print a stack trace along with an error message; this is not generally indicative of a problem (in a future release, we'll suppress the stack trace by default since it seems to confuse too many users).

Installing

To install OCRopus dependencies system-wide:

$ sudo apt-get install $(cat PACKAGES)
$ wget -nd https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz
$ mv en-default.pyrnn.gz models/
$ sudo python setup.py install

Alternatively, dependencies can be installed into a Python Virtual Environment:

$ virtualenv ocropus_venv/
$ source ocropus_venv/bin/activate
$ pip install -r requirements.txt
$ wget -nd https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz
$ mv en-default.pyrnn.gz models/
$ python setup.py install

An additional method using Conda is also possible:

$ conda create -n ocropus_env python=2.7
$ conda activate ocropus_env
$ conda install --file requirements.txt
$ wget -nd https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz
$ mv en-default.pyrnn.gz models/
$ python setup.py install

To test the recognizer, run:

$ ./run-test

Running

To recognize pages of text, you need to run separate commands: binarization, page layout analysis, and text line recognition. The default parameters and settings of OCRopus assume 300dpi binary black-on-white images. If your images are scanned at a different resolution, the simplest thing to do is to downscale/upscale them to 300dpi. The text line recognizer is fairly robust to different resolutions, but the layout analysis is quite resolution dependent.

Here is an example for a page of Fraktur text (German); you need to download the Fraktur model from https://github.com/zuphilip/ocropy-models/raw/master/fraktur.pyrnn.gz to run this example:

# perform binarization
./ocropus-nlbin tests/ersch.png -o book

# perform page layout analysis
./ocropus-gpageseg 'book/????.bin.png'

# perform text line recognition (on four cores, with a fraktur model)
./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz 'book/????/??????.bin.png'

# generate HTML output
./ocropus-hocr 'book/????.bin.png' -o ersch.html

# display the output
firefox ersch.html

There are some things the currently trained models for ocropus-rpred will not handle well, largely because they are nearly absent in the current training data. That includes all-caps text, some special symbols (including "?"), typewriter fonts, and subscripts/superscripts. This will be addressed in a future release, and, of course, you are welcome to contribute new, trained models.

You can also generate training data using ocropus-linegen:

ocropus-linegen -t tests/tomsawyer.txt -f tests/DejaVuSans.ttf

This will create a directory "linegen/..." containing training data suitable for training OCRopus with synthetic data.

Roadmap

Project Announcements
The text line recognizer has been ported to C++ and is now a separate project, the CLSTM project, available here: https://github.com/tmbdev/clstm
New GPU-capable text line recognizers and deep-learning based layout analysis methods are in the works and will be published as separate projects some time in 2017.
Please welcome @zuphilip and @kba as additional project maintainers. @tmb is busy developing new DNN models for document analysis (among other things). (10/15/2016)

A lot of excellent packages have become available for deep learning, vision, and GPU computing over the last few years. At the same time, it has become feasible now to address problems like layout analysis and text line following through attentional and reinforcement learning mechanisms. I (@tmb) am planning on developing new software using these new tools and techniques for the traditional document analysis tasks. These will become available as separate projects.

Note that for text line recognition and language modeling, you can also use the CLSTM command line tools. Except for taking different command line options, they are otherwise drop-in replacements for the Python-based text line recognizer.

Contributing

OCRopy and CLSTM are both command line driven programs. The best way to contribute is to create new command line programs using the same (simple) persistent representations as the rest of OCRopus.

The biggest needs are in the following areas:

text/image segmentation
text line detection and extraction
output generation (hOCR and hOCR-to-* transformations)

CLSTM vs OCRopy

The CLSTM project (https://github.com/tmbdev/clstm) is a replacement for ocropus-rtrain and ocropus-rpred in C++ (it used to be a subproject of ocropy but has been moved into a separate project now). It is significantly faster than the Python versions and has minimal library dependencies, so it is suitable for embedding into C++ programs.

Python and C++ models can not be interchanged, both because the save file formats are different and because the text line normalization is slightly different. Error rates are about the same.

In addition, the C++ command line tool (clstmctc) has different command line options and currently requires loading training data into HDF5 files, instead of being trained off a list of image files directly (image file-based training will be added to clstmctc soon).

The CLSTM project also provides LSTM-based language modeling that works very well with post-processing and correcting OCR output, as well as solving a number of other OCR-related tasks, such as dehyphenation or changes in orthography (see our publications). You can train language models using clstmtext.

Generally, your best bet for CLSTM and OCRopy is to rely only on the command line tools; that makes it easy to replace different components. In addition, you should keep your OCR training data in .png/.gt.txt files so that you can easily retrain models as better recognizers become available.

After making CLSTM a full replacement for ocropus-rtrain/ocropus-rpred, the next step will be to replace the binarization, text/image segmentation, and layout analysis in OCRopus with trainable 2D LSTM models.

Comments

How to always read left to right?

Hi guys,

I've been developing a bit with Ocropy but it sometimes seems to read from top to bottom, I'd like it to always read from left to right, no matter what. Does anybody have any clue on how to do this?

P.S: my apoligies for creating an issue for this.
:grey_question: question

opened by Yenthe666 16
error while training

After executing (on 156 files of groundtruth text and imagery): ocropus-rtrain gt/????/*.png -F 10000 -o mub_combined & I've got the following reproduceable error:

454 150.32 (1486, 48) gt/0001/01000b.bin.png TRU: u'quod dicitur Fulda, quod est situm in pago Grapfeld, constructum in honore sancti' ALN: u'quuod dicituur Fuulda, qquod et situumm in pagoo Grapfeld, construuctuuumm in honnore ' OUT: u' iiii ii te ti imm tm e iii eutmut m mi eii '

oops, got FloatingPointError overflow encountered in exp

Traceback (most recent call last): File "/usr/local/bin/ocropus-rtrain", line 228, in pcs = network.trainSequence(line,cs,update=do_update,key=fname) File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 863, in trainSequence self.outputs = array(self.lstm.forward(xs)) File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 587, in forward xs = net.forward(xs) File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 636, in forward outputs = [net.forward(xs) for net in self.nets] File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 545, in forward self.WIP,self.WFP,self.WOP) File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 419, in forward_py go[t] = ffunc(gox[t]) File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 367, in ffunc return 1.0/(1.0+exp(-x)) FloatingPointError: overflow encountered in exp Traceback (most recent call last): File "/usr/local/bin/ocropus-rtrain", line 232, in network = ocrolib.load_object(last_save) File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 502, in load_object fname = ocropus_find_file(fname) File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 680, in ocropus_find_file if os.path.exists(fname): File "/usr/lib/python2.7/genericpath.py", line 18, in exists os.stat(path) TypeError: coercing to Unicode: need string or buffer, NoneType found

another case with half of the files (dir 0001 only):

960 110.63 (1490, 48) gt/0001/010022.bin.png TRU: u'in honorem\u2074 domini salvatoris Jesu Christi et beate Marie genetricis\u2075 eius episco-' ALN: u'in honorem~ domini salvatoris Jesu Christi et beate MMarie genetricis eius episco-' OUT: u'iu bouoreu ouiui salvatoris lesu bristi et beate arie geuetricis eius episoo-'

oops, got FloatingPointError overflow encountered in exp

Traceback (most recent call last): File "/usr/local/bin/ocropus-rtrain", line 228, in pcs = network.trainSequence(line,cs,update=do_update,key=fname) File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 863, in trainSequence self.outputs = array(self.lstm.forward(xs)) File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 587, in forward xs = net.forward(xs) File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 636, in forward outputs = [net.forward(xs) for net in self.nets] File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 619, in forward return self.net.forward(xs[::-1])[::-1] File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 545, in forward self.WIP,self.WFP,self.WOP) File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 419, in forward_py go[t] = ffunc(gox[t]) File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 367, in ffunc return 1.0/(1.0+exp(-x)) FloatingPointError: overflow encountered in exp Traceback (most recent call last): File "/usr/local/bin/ocropus-rtrain", line 232, in network = ocrolib.load_object(last_save) File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 502, in load_object fname = ocropus_find_file(fname) File "/usr/local/lib/python2.7/dist-packages/ocrolib/common.py", line 680, in ocropus_find_file if os.path.exists(fname): File "/usr/lib/python2.7/genericpath.py", line 18, in exists os.stat(path) TypeError: coercing to Unicode: need string or buffer, NoneType found

opened by stexandev 16
Is there a way to access confidence level?

We are running it on some documents and we need absolute accuracy so we are using human proofreading, but is there any way to access confidence level so that we can just examine the ones with low confidence level?
:sparkles: enhancement

opened by 1a1a11a 15
Higher error probability with first letter of line
Expected Behavior

Hallo. I am training Ocropus with the Hume dialogues pages. I am following a methodology of look ahead simulations. A trained model, starting with the default model, is applied to the Hume images, starting with page 8. Lines with errors are then picked as the training set and used to train the model. This is then applied to the pages that follow and so on. One thing I am observing is there are more errors with the first letter than a letter at any other position on a line. Is this expected or is it a bug or a deficiency?

Current Behavior

Expecting the error rate would be the same at any position of a line.

Possible Solution

Steps to Reproduce (for bugs)

Download the Hume dialogs pages.

Run Ocropus on these images (The image segmentation is easier starting page 8, so I started with that).

Pick up lines that shows errors. Generate text files with fixed lines.

Train Ocropus with the lines so produced.

Repeat steps 2, 3 and 4, each time running Ocropus on subsequent pages.

Your Environment

Python version: 2.7.10 for training, 2.7.6 while running on images

Git revision of ocropy: Not sure. I downloaded it 11 Feb 2017.

Operating System and version:
for training Cray supercomputer, for running images bash on Ubuntu on windows 10

:grey_question: question
opened by urhub 14

getting typeerror object of type 'NoneType' has no len()

i am trying to ./runtest i am getting error saying

$ ./run-test
INFO:  # ./tests/testpage.png
INFO:  === ./tests/testpage.png 1
INFO:  estimating skew angle
INFO:  estimating thresholds
INFO:  rescaling
INFO:  ./tests/testpage.png lo-hi (0.39 1.44) angle  0.1 no-normalization
INFO:  writing
INFO:
INFO:  ########## C:/Users/allud/ocropy/env/Scripts/ocropus-gpageseg temp/????
INFO:
INFO:  temp\0001.bin.png
INFO:  scale 19.493589
INFO:  computing segmentation
INFO:  computing column separators
INFO:  considering at most 3 whitespace column separators
INFO:  computing lines
INFO:  propagating labels
INFO:  spreading labels
INFO:  number of lines 100
INFO:  finding reading order
INFO:  writing lines
INFO:      91  temp\0001.bin.png 19.5 92
INFO:
INFO:  ########## C:/Users/allud/ocropy/env/Scripts/ocropus-rpred -n temp/????
INFO:
INFO:  #inputs: 92
Traceback (most recent call last):
  File "C:/Users/allud/ocropy/env/Scripts/ocropus-rpred", line 120, in <module>
    network = ocrolib.load_object(args.model,verbose=1)
  File "c:\python27\Lib\ocrolib\common.py", line 435, in load_object
    fname = ocropus_find_file(fname)
  File "c:\python27\Lib\ocrolib\common.py", line 625, in ocropus_find_file
    sysconfig.get_config_var("datarootdir"), "ocropus"))
  File "C:\Users\allud\ocropy\env\lib\ntpath.py", line 65, in join
    result_drive, result_path = splitdrive(path)
  File "C:\Users\allud\ocropy\env\lib\ntpath.py", line 115, in splitdrive
    if len(p) > 1:
TypeError: object of type 'NoneType' has no len()
(env)

please help.

opened by CruzzRazor 12

Add new tags for older releases

Hi @kba, @zuphilip :smile:

I suggest to add new tags for older releases: 0.5, 0.5.4, 0.6, 0.7 See: https://github.com/tmbdev/ocropy/wiki/Older-versions

I also suggest to remove these confusing tags and to add new tags instead: classic-ocropy-0.1.1 => 0.7.2 / 0.8.1 classic-ocropy-0.1 => 0.7.1 / 0.8.0

opened by amitdo 12
Not segmenting if image size is less than 600x600

The first part of binarization is working correctly for my image.But in the segmentation step if both or one of the height or width is less than 600x600, ocropus segmentation is not segmenting the binarized image. Is it necessary to have image > = 600x600, because upsampling the image makes the ocr part miserable. And also why is this 600x600 limitation? I am giving a cropped input to ocropus, so resizing is a problem.

opened by srika91 11
Updating the wiki

I am spending some hours on a class project to update Ocropy documentation. I would like to collect requests for documentation you want to see in the Wiki.

Possible Solution

Please add comments here what you would like to see added or changed to the documentation.
:pencil2: documentation

opened by urhub 10
Travis CI builds

A very simple setup for testing via Travis. It uses Miniconda to install dependencies and then runs the test script. I originally tried a straight install, but compiling SciPy took so long, it ran out of build time. The conda install is much much quicker.

You will of course have to enable Travis for your repo before this has a real effect. And it doesn't really check whether the result correct, only that something is produced, but that's how the existing test script is. One could also do coverage testing using the other script and coveralls.io, but that's something for a another PR.

opened by QuLogic 10
Could you recommend some materials about the algorithm you use?

Hi! I feel this project is very interesting and I want to learn from it. So could you recommend me some materials(papers or books) you referred in this project? Thank you very much

opened by hsmyy 10

RuntimeError: could not open display

After installed on the command line CentOS server, when I try to run any OCRopus command, I get the following error:

$ ./ocropus-nlbin -h
Traceback (most recent call last):
 File "./ocropus-nlbin", line 5, in <module>
   from pylab import *
 File "/usr/lib64/python2.7/site-packages/pylab.py", line 1, in <module>
   from matplotlib.pylab import *
 File "/usr/lib64/python2.7/site-packages/matplotlib/pylab.py", line 265, in <module>
   from matplotlib.pyplot import *
 File "/usr/lib64/python2.7/site-packages/matplotlib/pyplot.py", line 97, in <module>
   _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
 File "/usr/lib64/python2.7/site-packages/matplotlib/backends/__init__.py", line 25, in pylab_setup
   globals(),locals(),[backend_name])
 File "/usr/lib64/python2.7/site-packages/matplotlib/backends/backend_gtkagg.py", line 10, in <module>
   from matplotlib.backends.backend_gtk import gtk, FigureManagerGTK, FigureCanvasGTK,\
 File "/usr/lib64/python2.7/site-packages/matplotlib/backends/backend_gtk.py", line 13, in <module>
   import gtk; gdk = gtk.gdk
 File "/usr/lib64/python2.7/site-packages/gtk-2.0/gtk/__init__.py", line 64, in <module>
   _init()
 File "/usr/lib64/python2.7/site-packages/gtk-2.0/gtk/__init__.py", line 52, in _init
   _gtk.init_check()
RuntimeError: could not open display

Expected Behavior

On my home Ubuntu within Gnome command line it works just fine.

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Install on CentOS command-line only server (without display)
Run any command, like $ ./ocropus-nlbin -h
Get the error

Your Environment

Python version: 2.7.5

Git revision of ocropy: commit 358df8d104cf78fb0104bd28f333f272d908d4c3 Merge: dacf0fc e016e74 Author: Philipp Zumstein [email protected] Date: Mon May 22 22:38:33 2017 +0200

Merge pull request #219 from tmbdev/del-bbox-func

Delete unused function bounding_box in ocropus-linegen
Operating System and version: CentOS Linux release 7.3.1611 (Core)

:computer: installation

opened by vlad-wonderkidstudio 9

On-premise to cloud migration issue
Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Python version:

Git revision of ocropy:

Operating System and version:
opened by cristinelpopescu 0
I want to get 1,000 synthetically generated data? Where do i set the number of data's to be generated? Thanks
Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Python version:

Git revision of ocropy:

Operating System and version:
opened by marutcomp 0

EOF error with cpickle.Unpickler in common.py

I am trying to run the very basic example found in the README file.

I reached the following line:

./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz 'book/0001/010001.bin.png'

But it gives me the following error:

INFO:
INFO:  ########## ./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz book/0001/01
INFO:
INFO:  #inputs: 1
# loading object .\.\models/fraktur.pyrnn.gz
Traceback (most recent call last):
  File "./ocropus-rpred", line 120, in <module>
    network = ocrolib.load_object(args.model,verbose=1)
  File "C:\Users\96171\Desktop\ocropy\ocrolib\common.py", line 445, in load_object
    return unpickler.load()
EOFError

The error is appearing in the following function from common.py:

def load_object(fname,zip=0,nofind=0,verbose=0):
    """Loads an object from disk. By default, this handles zipped files
    and searches in the usual places for OCRopus. It also handles some
    class names that have changed."""
    if not nofind:
        fname = ocropus_find_file(fname)
    if verbose:
        print("# loading object", fname)
    if zip==0 and fname.endswith(".gz"):
        zip = 1
    if zip>0:
        # with gzip.GzipFile(fname,"rb") as stream:
        with os.popen("gunzip < '%s'"%fname,"rb") as stream:
            unpickler = cPickle.Unpickler(stream)
            unpickler.find_global = unpickle_find_global
            return unpickler.load()
    else:
        with open(fname,"rb") as stream:
            unpickler = cPickle.Unpickler(stream)
            unpickler.find_global = unpickle_find_global
            return unpickler.load()

opened by hiyamgh 2

AssertionError: you must install and use OCRopus with Python version 2.7 or later, but not Python 3.x

I have used Python 2.7 virtual environment for installing requirements.txt But I got the following:

DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support
C:\Users\User\venv\ocropus\lib\site-packages\pip\_vendor\urllib3\util\ssl_.py:380: SNIMissingWarning: An HTTPS request has been made, but the SNI (Server Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  SNIMissingWarning,
C:\Users\User\venv\ocropus\lib\site-packages\pip\_vendor\urllib3\util\ssl_.py:139: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecurePlatformWarning,
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '_ssl.c:499: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version'),)': /simple/numpy/
C:\Users\User\venv\ocropus\lib\site-packages\pip\_vendor\urllib3\util\ssl_.py:139: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecurePlatformWarning,
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '_ssl.c:499: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version'),)': /simple/numpy/
C:\Users\User\venv\ocropus\lib\site-packages\pip\_vendor\urllib3\util\ssl_.py:139: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecurePlatformWarning,
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '_ssl.c:499: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version'),)': /simple/numpy/
C:\Users\User\venv\ocropus\lib\site-packages\pip\_vendor\urllib3\util\ssl_.py:139: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecurePlatformWarning,
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '_ssl.c:499: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version'),)': /simple/numpy/
C:\Users\User\venv\ocropus\lib\site-packages\pip\_vendor\urllib3\util\ssl_.py:139: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecurePlatformWarning,
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '_ssl.c:499: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version'),)': /simple/numpy/
Could not fetch URL https://pypi.org/simple/numpy/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/numpy/ (Caused by SSLError(SSLError(1, '_ssl.c:499: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version'),)) - skipping
C:\Users\User\venv\ocropus\lib\site-packages\pip\_vendor\urllib3\util\ssl_.py:139: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecurePlatformWarning,
ERROR: Could not find a version that satisfies the requirement numpy (from versions: none)
ERROR: No matching distribution found for numpy
Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError(SSLError(1, '_ssl.c:499: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version'),)) - skipping
C:\Users\User\venv\ocropus\lib\site-packages\pip\_vendor\urllib3\util\ssl_.py:139: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecurePlatformWarning,
(ocropus)

So I ended up using Pythn 3.6.

I was able with python 3.6 to install requirements.txt but when I try now to install setu.py I get the following error:

AssertionError: you must install and use OCRopus with Python version 2.7 or later, but not Python 3.x

So how can I use python 2.7 and I'm not able to install the requirements.txt for it ?

opened by hiyamgh 7

Trying to test out ocropus from sources
Trying to just try out OCRopus on some files that tesseract fails (badly) on.

I download the zip file from github, and expand it into /tmp/ocropus. Then following README.md,

I make sure all the packages in PACKAGES are installed. This is a Fedora 30 system, so it uses dnf, not apt-get, but all the packages are there.

I use wget to get 83826134 Nov 2 2014 en-default.pyrnn.gz

I then try to move to models/:

/tmp/ocropus > mv en-default.pyrnn.gz models/ mv: cannot move 'en-default.pyrnn.gz' to 'models/': Not a directory

so to correct this, I: /tmp/ocropus > mkdir models models created /tmp/ocropus > mv en-default.pyrnn.gz models/ /tmp/ocropus > ls models 83826134 en-default.pyrnn.gz

I don't want to install in /usr/bin, since I just want to try it, but let's let is go to see what happens:

python setup.py install running install running build running build_py error: package directory 'ocrolib' does not exist

Another missing directory, so mkdir ocrolib, and try again

Now we get, in a much longer set of messages:

package init file 'ocrolib/init.py' not found (or not a regular file) ...

warning: install_lib: 'build/lib' does not exist -- no Python modules to install

and finally:

copying build/scripts-2.7/ocropus-gated-train -> /usr/bin error: [Errno 13] Permission denied: '/usr/bin/ocropus-gated-train'

Now trying to test,

/tmp/ocropus > ./run-test Traceback (most recent call last): File "./ocropus-nlbin", line 15, in import ocrolib ImportError: No module named ocrolib

but we have a directory ocrolib/, but it is empty.

Possible Solution

Your Environment

Python version: Python 2.7.17

Git revision of ocropy: fatal: not a git repository (or any of the parent directories): .git

Operating System and version: Fedora Linux 30
opened by crazylyle 0

Releases(v1.3.3)

v1.3.3(Dec 16, 2017)
Fix:

Version numbers weren't updated in hocr output and setup.py in v1.3.2 #270

Source code(tar.gz)
Source code(zip)
v1.3.2(Dec 16, 2017)
Added:

Add -f/--file option to ocropus-rtrain to read input filenames from a file #275

Fixed:

Do not add $datarootdir/ocropus to model search path in windows #268

Changed:

Use numpy functions instead of C implementations of sumprod/sumouter #265 #276

Code and docs for on-the-fly compilation of C code removed #274

Remove unused lru function annotaiton #273

Source code(tar.gz)
Source code(zip)
v1.3.1(Dec 9, 2017)
Python coding:

Standardize Imports - Part I, #176 #206

split functions from nlbin #244

Other features:

Clip exponential in ffunc to avoid overflow #201

Ignored empty lines in fonts list file #233

Change checks for write_page_segmentation #220 (allow also small images for segmentation)

Expand tests for coverage and CI

Bugfixing: #251, #252
Source code(tar.gz)
Source code(zip)
v1.3.0(Dec 9, 2017)
Testing, continuous integration:

Test page workflow and confidence measure, update coverage test #145

circle.yml file for Circle CI testing, run-test-ci #149

Travis CI builds #37

Unit tests #209

Python coding:

Cleanup imports in common.py, lstm.py, extract exceptions #154

py3k: Use print function instead of statement #155

py3k: Use new-style exceptions. #175

Other features:

Fix behaviour of maxcolseps parameter in ocropus-gpageseg # 172

Update characters for training #188

Print summaries to stdout instead of stderr #170

Bugfixing: #133, #131, #179, #218 Cleanup: #180, #207, #181, #216, #219 Documentation: #185, #193, #194, #196, #205, #217
Source code(tar.gz)
Source code(zip)
v1.2.0(Dec 9, 2017)
Improve installation process: #108, #110, #111, #117, #146, #148, #152

Add new option --probabilities in ocropus-rpred #135

Fix and improve hocr metadata: #105, #160

Bugfixing: #103, #123, #140 Cleanup: #84, #124, #143, #150, #156 Documentation: #120
Source code(tar.gz)
Source code(zip)
v1.1.1(Dec 9, 2017)
New feature:

Added better print methods for distinguishing between info and error messages #53

Bugfixing and cleanup: #40, #43, #44, #75, #76

Expand documentations: #34, #39
Source code(tar.gz)
Source code(zip)
v1.1.0(Dec 9, 2017)
New features:

Connect to CLSTM training and recognition

Added ocropus-lpred

Added ocropus-ltrain

Added ocropus-dewarp

Added ocropus-linegen

Added Apache license

Allow "extract" outside original image bounds #19

Documentation:

Add solution for OS X (clang) #28

Add instructions for installing ocropy into a virtualenv #29

Reorganization:

Updated download path for models

moved ocropus-gtedit back to main directory

Bugfixing
Source code(tar.gz)
Source code(zip)
v1.0(Nov 2, 2014)

A cleaned up version of OCRopy with everything but the new RNN recognizer retired.

Library files that aren't needed anymore have been removed (actually, moved into OLD for the time being).

The installation process has been simplified.

A simple example of training has been added.
Source code(tar.gz)
Source code(zip)
v0.8.1(Nov 1, 2014)

Classic ocropy with a few smallish fixes to make the test cases work after removing files.
Source code(tar.gz)
Source code(zip)
v0.8.0(Nov 1, 2014)

This release contains the segmenting recognizer, language modeling, beam search, tree-VQ recognizer, and LSTM recognizer, plus character database editing tools. This was part of the OCRopus release as of late 2013.
Source code(tar.gz)
Source code(zip)

Owner

OCRopus

The OCRopus OCR System and Related Software

GitHub

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Open Semantic Search https://opensemanticsearch.org Integrated search server, ETL framework for document processing (crawling, text extraction, text a

684 Jan 6, 2023

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

27.5k Jan 8, 2023

It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

4 Jul 11, 2022

Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

5 Dec 6, 2021

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

285 Dec 8, 2022

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

285 Dec 8, 2022

A simple document layout analysis using Python-OpenCV

Run the application: python main.py *Note: For first time running the application, create a folder named "output". The application is a simple documen

109 Dec 12, 2022

Document Layout Analysis Projects

Layout_Analysis Introduction This is an implementation of RLSA and X-Y Cut with OpenCV Dependencies OpenCV 3.0+ How to use Compile with g++ : g++ -std

22 Dec 8, 2022

Document Layout Analysis

Eynollah Document Layout Analysis Introduction This tool performs document layout analysis (segmentation) from image data and returns the results as P

198 Dec 29, 2022

CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

简介基于Tensorflow和Keras实现端到端的不定长中文字符检测和识别文本检测：CTPN 文本识别：DenseNet + CTC 环境部署 sh setup.sh 注：CPU环境执行前需注释掉for gpu部分，并解开for cpu部分的注释 Demo 将测试图片放入test_images

2.6k Dec 29, 2022

Tensorflow-based CNN+LSTM trained with CTC-loss for OCR

Overview This collection demonstrates how to construct and train a deep, bidirectional stacked LSTM using CNN features as input with CTC loss to perfo

489 Dec 21, 2022

CNN+LSTM+CTC based OCR implemented using tensorflow.

CNN_LSTM_CTC_Tensorflow CNN+LSTM+CTC based OCR(Optical Character Recognition) implemented using tensorflow. Note: there is No restriction on the numbe

356 Dec 8, 2022

Repository collecting all the submodules for the new PyTorch-based OCR System.

OCRopus3 is being replaced by OCRopus4, which is a rewrite using PyTorch 1.7; release should be soonish. Please check github.com/tmbdev/ocropus for up

138 Dec 9, 2022

Visual Attention based OCR

Attention-OCR Authours: Qi Guo and Yuntian Deng Visual Attention based OCR. The model first runs a sliding CNN on the image (images are resized to hei

1.1k Jan 2, 2023

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

About An OCR translator tool. Made by me by utilizing Tesseract, compiled to .exe using pyinstaller. I made this program to learn more about python. I

41 Dec 30, 2022

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

doc2text doc2text extracts higher quality text by fixing common scan errors Developing text corpora can be a massive pain in the butt. Much of the tex

1.3k Jan 4, 2023

Python tool that takes the OCR.space JSON output as input and draws a text overlay on top of the image.

OCR.space OCR Result Checker => Draw OCR overlay on top of image Python tool that takes the OCR.space JSON output as input, and draws an overlay on to

4 Oct 18, 2022

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

16.7k Jan 3, 2023

A Python wrapper for the tesseract-ocr API

tesserocr A simple, Pillow-friendly, wrapper around the tesseract-ocr API for Optical Character Recognition (OCR). tesserocr integrates directly with

1.7k Dec 31, 2022

Python-based tools for document analysis and OCR

Related tags

Overview

ocropy

Installing

Running

Roadmap

Contributing

CLSTM vs OCRopy

Comments

oops, got FloatingPointError overflow encountered in exp

another case with half of the files (dir 0001 only):

oops, got FloatingPointError overflow encountered in exp

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Possible Solution

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Possible Solution

Your Environment

Releases(v1.3.3)

v1.3.3(Dec 16, 2017)

v1.3.2(Dec 16, 2017)

v1.3.1(Dec 9, 2017)

v1.3.0(Dec 9, 2017)

v1.2.0(Dec 9, 2017)

v1.1.1(Dec 9, 2017)

v1.1.0(Dec 9, 2017)

v1.0(Nov 2, 2014)

v0.8.1(Nov 1, 2014)

v0.8.0(Nov 1, 2014)

Owner

OCRopus

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

Indonesian ID Card OCR using tesseract OCR

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

A simple document layout analysis using Python-OpenCV

Document Layout Analysis Projects

Document Layout Analysis

CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

Tensorflow-based CNN+LSTM trained with CTC-loss for OCR

CNN+LSTM+CTC based OCR implemented using tensorflow.

Repository collecting all the submodules for the new PyTorch-based OCR System.

Visual Attention based OCR

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

Python tool that takes the OCR.space JSON output as input and draws a text overlay on top of the image.

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

A Python wrapper for the tesseract-ocr API