Awesome OCR

This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR).

Contributions are welcome, as is feedback.

Software
- OCR engines
- Older and possibly abandoned OCR engines
- OCR file formats
  - hOCR
  - ALTO XML
  - TEI
  - PAGE XML
- OCR CLI
- OCR GUI
- OCR Preprocessing
- OCR as a Service
- OCR evaluation
- OCR libraries by programming language
  - Go
  - Java
  - .Net
  - Object Pascal
  - PHP
  - Python
  - Javascript
  - Ruby
  - Swift
  - Rust
  - R
- OCR training tools
Datasets
- Ground Truth
Literature

Software

OCR engines

tesseract - The definitive Open Source OCR engine Apache 2.0
EasyOCR - OCR engine built on PyTorch by JaidedAI, Apache 2.0
ocropus - OCR engine based on LSTM, Apache 2.0
ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
kraken - Ocropus fork with sane defaults
gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
Ocrad - The GNU OCR. GPL
ocular - Machine-learning OCR for historic documents
SwiftOCR - fast and simple OCR library written in Swift
attention-ocr - OCR engine using visual attention mechanisms
RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
simple-ocr-opencv and its fork - A simple pythonic OCR engine using opencv and numpy
Calamari - OCR Engine based on OCRopy and Kraken

Older and possibly abandoned OCR engines

Clara OCR - Open source OCR in C GPL
Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
Eye - an experimental Java OCR (image-to-text) application
kognition - An omnifont OCR software for KDE
OCRchie - Modular Optical Character Recognition Software
ocre - o.c.r. easy
xplab - A GTK 2 tool for pattern matching
hebOCR - Hebrew character recognition library (previously named hocr, see Wikipedia article) GPL

OCR file formats

hOCR

hocr-tools - Tools for doing various useful things with hOCR files, Apache 2.0
hocr-spec - hOCR 1.2 specification
ocr-transform - CLI tool to convert between hOCR and ALTO, MIT
hocr-parser - hOCR Specification Python Parser
hOCRTools - hOCR to ALTO conversion XSLT

ALTO XML

ALTO XML Schema - XML Schema and development of the ALTO XML format
ALTO XML Documentation - Documentation and use cases for ALTO
alto-tools - Various tools to work with ALTO files, Python
AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML

TEI

TEI-OCR - TEI customization for OCR generated layout and content information
TEI SIG on Libraries - Best Practices for TEI in Libraries
GDZ - METS/TEI-based GDZ document format

PAGE XML

PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
py-pagexml - Python library for handling PAGE XML and OPF files.

OCR CLI

OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
Ocrocis - Project manager interface for Ocropy, see also external project homepage
tesseract-recognize - Tesseract-based tool that outputs result in Page XML format (docker image).

OCR GUI

moz-hocr-editor - Firefox Addon for editing hOCR files Discontinued
qt-box-editor - QT4 editor of tesseract-ocr box files.
ocr-gt-tools - Client-Server application for editing OCR ground truth.
Paperwork - Using scanners and OCR to grep paper documents the easy way.
Paperless - Scan, index, and archive all of your paper documents.
gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including jTessBoxEditor a graphical Tesseract box data editor
PoCoTo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
PRImA PAGE Viewer - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
LAREX - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
archiscribe - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in @jbaiter/archiscribe-corpus.
nw-page-editor - Simple app for visual editing of Page XML files. Provides desktop and server docker-based versions.

OCR Preprocessing

NoiseRemove.java in MathOCR - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
typeface-corpus - A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities.
binarizewolfjolion - Comparison of binarization algorithms. Blog post
crop_morphology.py in oldnyc - Cropping a page to just the text block
Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
Fred's ImageMagick script textcleaner - Processes a scanned document of text to clean the text background
localcontrast - Fast O(1) local contrast optimization

OCR as a Service

Open OCR - Run Tesseract in Docker containers
tesseract-web-service - An implementation of RESTful web service for tesseract-OCR using tornado.
docker-ocropy - A Docker container for running the ocropy OCR system.
ABBYY Cloud OCR SDK Code samples - Code samples for using the proprietary commercial ABBYY OCR API.
nidaba - An expandable and scalable OCR pipeline
gamera - A meta-framework for building document processing applications, e.g. OCR
ocr-tools - Project to provide CLI and web service interfaces to common OCR engines
ocrad-docker - Run the ocrad OCR engine in a docker container
kraken-docker - Run the kraken OCR engine in a docker container
Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
ocr.space - Free Online OCR and OCR API by @a9t9 based on Tesseract (code is not open)
OCR4all - Provides OCR services through web applications. Included Projects: LAREX, OCRopus, calamari and nashi.

OCR evaluation

ISRI OCR Evaluation Tools with a User Guide from 1996 :!:
- isri-ocr-evaluation-tools - further development by @eddieantonio (2015, 2016)
- ancientgreekocr-evaluation-tools - further development by @nickjwhite (2013, 2014)
ocrevalUAtion - Cross-format evaluation, CLI and GUI
ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
quack - Quality-Assurance-tool for scans with corresponding ALTO-files

OCR libraries by programming language

Go

gosseract - Golang OCR library, wrapping Tesseract-ocr.

Java

Tess4J - Java Native Access bindings to Tesseract.
tess-two - Tools for compiling Tesseract on Android and Java API.

.Net

tesseract for .net - A .Net wrapper for tesseract-ocr.

Object Pascal

TTesseractOCR4 - Object Pascal binding for tesseract-ocr 4.x.

PHP

Tesseract OCR for PHP - Tesseract PHP bindings.

Python

pytesseract - A Python wrapper for Google Tesseract.
pyocr - A Python wrapper for Tesseract and Cuneiform.
ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
tesserocr - A Python wrapper for the tesseract-ocr API

Javascript

ocracy - pure javascript lstm rnn implementation based on ocropus
gocr.js - Javascript port (emscripten) of gocr
ocrad.js - Javascript port (emscripten) of ocrad
tesseract.js - Javascript port (emscripten) of Tesseract
node-tesseract-ocr - A simple wrapper for the Tesseract OCR package.
node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.

Ruby

rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby
ocr_space - API wrapper for free ocr service ocr.space. Includes CLI

Rust

tesseract.rs - Rust bindings for tesseract OCR.
leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.

R

tesseract - R bindings for tesseract OCR.

Swift

Tesseract OCR iOS - Swift and Objective-C wrapper for Tesseract OCR.
SwiftOCR - Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes.

OCR training tools

glyph-miner - A system for extracting glyphs from early typeset prints
ocrodeg - Document image degradation for OCR data augmentation

Datasets

Ground Truth

archiscribe-corpus - >4,200 lines transcribed from 19th Century German prints via archiscribe CC-BY 4.0
CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for PoCoTo

Rescribe - Transcriptions of Caroline Minuscule Manuscripts PDM 1.0

CLTK - Corpora from Classical Language Toolkit PDM 1.0
DIVA-HisDB - 150 pages^PAGE-XML of three medieval manuscripts CC-BY-NC 3.0
EarlyPrintedBooks - ~8,800 lines from several early printed books CC-BY-NC-SA 4.0
EEBO-TCP - 25,363 EEBO documents transcribed by TCP PDM 1.0
ECCO-TCP - 2,188 ECCO documents transcribed by TCP PDM 1.0
eMOP-TCP - 2,188 ECCO-TCP documents, cleaned up by eMOP PDM 1.0
Evans-TCP - 4,977 Evans documents transcribed by TCP
FDHN - Finnish Digitised Historical Newspapers, Paper, (free) registration required, Terms of Use
FROC-MSS - 4 Old French Medieval Manuscripts CC-BY 4.0
GERMANA - 764 Spanish manuscript pages, (free) registration required non-commercial use only
GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin CC-BY 4.0
imagessan - Sanskrit images & ground truth (Devanagari script)
IMPACT-BHL - 2,418 pages^PAGE-XML from the Biodiversity Heritage Library, XML@GitHub CC-BY 3.0
IMPACT-BL - 294 pages^PAGE-XML from the British Library, (free) registration required PDM 1.0
IMPACT-BNE - 215 pages^PAGE-XML from the National Library of Spain, (free) registration required, XML@GitHub CC-BY-NC-SA 4.0
IMPACT-BNF - 151 pages^PAGE-XML from the National Library of France, (free) registration required CC-BY-NC-SA 4.0
IMPACT-KB - 142 pages^PAGE-XML from the National Library of the Netherlands CC-BY 4.0
IMPACT-NKC - 187 pages^PAGE-XML from the Czech National Library, (free) registration required CC-BY-NC-SA 4.0
IMPACT-NLB - 19 pages^PAGE-XML from the National Library of Bulgaria, (free) registration required CC-BY-NC-ND 4.0
IMPACT-NUK - 209 pages^PAGE-XML from the National Library of Slovenia, (free) registration required CC-BY-NC-SA 4.0
IMPACT-PSNC - 478 pages^PAGE-XML from four Polish digital libraries, XML@GitHub CC-BY 3.0
LascivaRoma/lexical - Transcription of 19th century lexical resources for Latin learning
MJSynth - 9m synthetic images covering 90k English words
OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via Text+Berg digital CC-BY 4.0
OCR-D - 180 pages^PAGE-XML of German historical prints from OCR-D CC-BY-SA 4.0
OCR_GS_Data - Double-checked Arabic Gold Standard from OpenITI
old-books - 322 old books from Project Gutenberg GPL 3.0
PRImA-ENP - 528 pages^PAGE-XML historic newspapers from Europeana Newspapers, (free) registration required PDM 1.0
RODRIGO - 853 Spanish manuscript pages, (free) registration required non-commercial use only
Toebler-OCR - (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch

Literature

OCR-related publication and link lists

IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
OCR-D - List of OCR-related academic articles in the context of the OCR-D project. 🇩🇪
Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
Wikipedia: Comparison of optical character recognition software
OCR [and Deep Learning] by @handong1587
Ocropus Wiki: Publications

Blog Posts and Tutorials

Tesseract Blends Old and New OCR Technology (2016) @theraysmith
- Tutorial@DAS2016, Updated "What You Always Wanted to Know" slides
What You Always Wanted To Know About Tesseract (2014) @theraysmith
- Tutorial@DAS2014, includes demos
Extracting text from an image using Ocropus (2015)
Training an Ocropus OCR model (2015) @danvk
Ocropus Wiki: Compute errors and confusions (2016) @zuphilip
Ocropus Wiki: Working with Ground Truth (2016) @zuphilip
OCRopus (2016) @jze
- mostly on column separation in ocropus
10 Tips for making your OCR project succeed (2013) @cneud
- general things to consider for OCR projects
Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology -
- feature list for a commercial image pre-processing library; has nice before-after samples for pre-processing steps related to OCR
Extracting Text from PDFs; Doing OCR; all within R @shawngraham
- How to work with OCR from PDFs in the R programming environment
Tutorial: Command-line OCR on a Mac @bmschmidt
- Tutorial on how to run tesseract in Mac OSX
Practical Expercience with OCRopus Model Training (2016) @jze
Homemade Manuscript OCR (1): OCRopy (2017) @Jean-Baptiste-Camps
- Tutorial on applying OCR to medieval manuscripts with OCRopy
Optimizing Binarization for OCRopus (2017) @jze
Prototype demo for OCR postfix in Danish Newspapers (2016) @thomasegense
How Can I OCR My Dictionary? (2016) @JessedeDoes
"Needlessly complex" blog (2016) @mzucker. Several image processing how-tos (Python based), particularly:
- Page dewarping (code)
- Compressing and enhancing hand-written notes (code)
- Unprojecting text with ellipses (code)
(Open-Source-)OCR-Workflows (2017) @wrznr 🇩🇪 overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the @OCR-D project.
A gentle introduction to OCR (2018) @shgidi
Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR (2019) @eliaskreyenbuehl 🇩🇪 A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts.

OCR Showcases

abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
MathOCR - A printed scientific document recognition system, pre-alpha

Academic articles

(Open-Source-)OCR-Workflows

Kay-Michael Würzner @wrznr (Open-Source-)OCR-Workflows https://edoc.bbaw.de/frontdoor/index/index/docId/2786

I don't know how much awesome is it, since I don't know German :-)

opened by amitdo 3
Image pre-processing scripts

There are some nice python scripts for various image pre-processing tasks described at https://mzucker.github.io/ eg "Unprojecting text with ellipses", "Compressing and enhancing hand-written notes", and "Page dewarping". The blog posts contain links to the scripts themselves in the author's GitHub repository.

opened by DavidUnderdown 3
Add post in "Blog Posts and Tutorials"

Title: Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR Author: eliaskreyenbuehl / Universitätsbibliothek Basel Year: 2019 Language: German URL: https://blog.ub.unibas.ch/2019/06/04/worauf-kann-ich-mich-verlassen-arbeiten-mit-digitalisierten-quellen-teil-1-ocr/ Description: A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts.

opened by diegosiqueir4 2
ProGanSR [achieving super resolution in 1 second]

1 second super-resolution, we are in the golden age. http://igl.ethz.ch/projects/prosr/ https://github.com/fperazzi/proSR https://www.youtube.com/watch?v=HvH0b9K_Iro

opened by ghost 2
ocropus - OCR engine based on CLSTM, Apache 2.0

ocropus - OCR engine based on CLSTM, Apache 2.0

It's the other way around. 'ocropy' was out before 'clstm' :-)

Maybe the intention was 'LSTM' and not 'CLSTM'.

Using 'ocropus' and linking to ocropy might confuse anyone that does not know ocropy's history.

opened by amitdo 2
TeluguOCR + Chamanti OCR

Banti Telugu OCR: https://github.com/TeluguOCR/banti_telugu_ocr "This framework relies on the ability of a segmentation algorithm to break the text in to glyphs."

Chamanti OCR: https://github.com/rakeshvar/chamanti_ocr "It will not rely on segmentation algorithms (at the glyph level), making it ideal for highly agglutinative scripts like Arabic, Devanagari etc. We will be starting with Telugu however."

It is hard to guess for me, how good the recognition work, because I don't understand Telugu and I haven't found any results, discussions, blog posts (which I can read). But the project looks IMO very promising from a technical point. CC @rakeshvar

opened by zuphilip 2
Add Attention-OCR

Just stumbled upon this. I'm not quite sure if it really fits into the "Engines" section, since it's pretty bare-bones. Judging from the README, it's also targeted at natural scene images, but reading through the architecture I don't see any reason why it shouldn't work for printed text as well. Unfortunately I was unable to find a corresponding paper that could shed more light on the inner workings, but since the repository is pretty fresh (~ 3 weeks at the time of this PR), maybe it's still a work in progress and the publication will follow at some later point.

opened by jbaiter 2
Add OCR4all to OCR as a Service

OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings. Reul, Christian; Christ, Dennis; Hartelt, Alexander; Balbach, Nico; Wehner, Maximilian; Springmann, Uwe; Wick, Christoph; Grundig, Christine; Büttner, Andreas; Puppe, Frank in Applied Sciences (2019). 9(22).

https://doi.org/10.3390/app9224853

opened by diegosiqueir4 1
Add few Ground Truth Repositories

I have added small repositories for ground truth I have worked on, with on found while searching for 19th century gt sets. I would like to add https://github.com/gesaretto/paleo_ocr but the license is not clear (I have open an issue for this). I think the CLTK link is a bit of a stretch, as they do not provide any ground truth for OCR (to my knowledge)

opened by PonteIneptique 1
Update Attention-OCR

The Attention-OCR library that's linked to in this repo won't run on any of the recent Tensorflow versions or on ML engine. I maintain a fork that's being kept up-to-date and has some wrappers around the original model to provide better tooling. If you don't mind, it would make more sense to link to the fork instead. The original model is, of course, attributed to.

opened by emedvedev 1
Add normcap

Add normcap.

OCR powered screen-capture tool to capture information instead of images.

https://github.com/dynobo/normcap

(I couldn't figure where best to put it on the list, otherwise this would be a PR.)

opened by Shayan-To 0
Require the IIT-CDIP Test Collection

Dear kba,

Hello, I am Junfeng, a student from the University of Tokyo.

I am wondering do you have the IIT-CDIP Test Collection Dataset? If so, could you please share it via google drive for academic usage?

Thank you!

Best regards, Junfeng

opened by Coldog2333 1
Order the ground truth section by type ?

Hi there ! I found out that a GT section was added while I was tempted to create my own awesome list. One thing that I think would be great is categorizing a little more this section (Manuscript / Early print / Modern / Contemporaneous ?). That would probably be a better way to browse these data.

opened by PonteIneptique 1