Links to awesome OCR projects

Overview

Awesome OCR

Awesome

This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR).

Contributions are welcome, as is feedback.

Software

OCR engines

  • tesseract - The definitive Open Source OCR engine Apache 2.0
  • EasyOCR - OCR engine built on PyTorch by JaidedAI, Apache 2.0
  • ocropus - OCR engine based on LSTM, Apache 2.0
  • ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
  • kraken - Ocropus fork with sane defaults
  • gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
  • Ocrad - The GNU OCR. GPL
  • ocular - Machine-learning OCR for historic documents
  • SwiftOCR - fast and simple OCR library written in Swift
  • attention-ocr - OCR engine using visual attention mechanisms
  • RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
  • simple-ocr-opencv and its fork - A simple pythonic OCR engine using opencv and numpy
  • Calamari - OCR Engine based on OCRopy and Kraken

Older and possibly abandoned OCR engines

  • Clara OCR - Open source OCR in C GPL
  • Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
  • Eye - an experimental Java OCR (image-to-text) application
  • kognition - An omnifont OCR software for KDE
  • OCRchie - Modular Optical Character Recognition Software
  • ocre - o.c.r. easy
  • xplab - A GTK 2 tool for pattern matching
  • hebOCR - Hebrew character recognition library (previously named hocr, see Wikipedia article) GPL

OCR file formats

hOCR

  • hocr-tools - Tools for doing various useful things with hOCR files, Apache 2.0
  • hocr-spec - hOCR 1.2 specification
  • ocr-transform - CLI tool to convert between hOCR and ALTO, MIT
  • hocr-parser - hOCR Specification Python Parser
  • hOCRTools - hOCR to ALTO conversion XSLT

ALTO XML

TEI

  • TEI-OCR - TEI customization for OCR generated layout and content information
  • TEI SIG on Libraries - Best Practices for TEI in Libraries
  • GDZ - METS/TEI-based GDZ document format

PAGE XML

  • PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
  • omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
  • py-pagexml - Python library for handling PAGE XML and OPF files.

OCR CLI

  • OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
  • Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
  • Ocrocis - Project manager interface for Ocropy, see also external project homepage
  • tesseract-recognize - Tesseract-based tool that outputs result in Page XML format (docker image).

OCR GUI

  • moz-hocr-editor - Firefox Addon for editing hOCR files Discontinued
  • qt-box-editor - QT4 editor of tesseract-ocr box files.
  • ocr-gt-tools - Client-Server application for editing OCR ground truth.
  • Paperwork - Using scanners and OCR to grep paper documents the easy way.
  • Paperless - Scan, index, and archive all of your paper documents.
  • gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
  • VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including jTessBoxEditor a graphical Tesseract box data editor
  • PoCoTo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
  • OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
  • PRImA PAGE Viewer - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
  • LAREX - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
  • archiscribe - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in @jbaiter/archiscribe-corpus.
  • nw-page-editor - Simple app for visual editing of Page XML files. Provides desktop and server docker-based versions.

OCR Preprocessing

OCR as a Service

OCR evaluation

OCR libraries by programming language

Go

  • gosseract - Golang OCR library, wrapping Tesseract-ocr.

Java

  • Tess4J - Java Native Access bindings to Tesseract.
  • tess-two - Tools for compiling Tesseract on Android and Java API.

.Net

Object Pascal

PHP

Python

  • pytesseract - A Python wrapper for Google Tesseract.
  • pyocr - A Python wrapper for Tesseract and Cuneiform.
  • ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
  • tesserocr - A Python wrapper for the tesseract-ocr API

Javascript

  • ocracy - pure javascript lstm rnn implementation based on ocropus
  • gocr.js - Javascript port (emscripten) of gocr
  • ocrad.js - Javascript port (emscripten) of ocrad
  • tesseract.js - Javascript port (emscripten) of Tesseract
  • node-tesseract-ocr - A simple wrapper for the Tesseract OCR package.
  • node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.

Ruby

  • rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
  • ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby
  • ocr_space - API wrapper for free ocr service ocr.space. Includes CLI

Rust

  • tesseract.rs - Rust bindings for tesseract OCR.
  • leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.

R

Swift

  • Tesseract OCR iOS - Swift and Objective-C wrapper for Tesseract OCR.
  • SwiftOCR - Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes.

OCR training tools

  • glyph-miner - A system for extracting glyphs from early typeset prints
  • ocrodeg - Document image degradation for OCR data augmentation

Datasets

Ground Truth

  • Rescribe - Transcriptions of Caroline Minuscule Manuscripts PDM 1.0

Literature

OCR-related publication and link lists

Blog Posts and Tutorials

OCR Showcases

  • abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
  • cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
  • MathOCR - A printed scientific document recognition system, pre-alpha

Academic articles

2011 and before

2012

2013

2014

2015

2016

2017

2018

Comments
  • (Open-Source-)OCR-Workflows

    (Open-Source-)OCR-Workflows

    Kay-Michael Würzner @wrznr (Open-Source-)OCR-Workflows https://edoc.bbaw.de/frontdoor/index/index/docId/2786

    I don't know how much awesome is it, since I don't know German :-)

    opened by amitdo 3
  • Image pre-processing scripts

    Image pre-processing scripts

    There are some nice python scripts for various image pre-processing tasks described at https://mzucker.github.io/ eg "Unprojecting text with ellipses", "Compressing and enhancing hand-written notes", and "Page dewarping". The blog posts contain links to the scripts themselves in the author's GitHub repository.

    opened by DavidUnderdown 3
  • Add post in

    Add post in "Blog Posts and Tutorials"

    Title: Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR Author: eliaskreyenbuehl / Universitätsbibliothek Basel Year: 2019 Language: German URL: https://blog.ub.unibas.ch/2019/06/04/worauf-kann-ich-mich-verlassen-arbeiten-mit-digitalisierten-quellen-teil-1-ocr/ Description: A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts.

    opened by diegosiqueir4 2
  • ProGanSR [achieving super resolution in 1 second]

    ProGanSR [achieving super resolution in 1 second]

    1 second super-resolution, we are in the golden age. http://igl.ethz.ch/projects/prosr/ https://github.com/fperazzi/proSR https://www.youtube.com/watch?v=HvH0b9K_Iro

    opened by ghost 2
  • ocropus - OCR engine based on CLSTM, Apache 2.0

    ocropus - OCR engine based on CLSTM, Apache 2.0

    ocropus - OCR engine based on CLSTM, Apache 2.0

    It's the other way around. 'ocropy' was out before 'clstm' :-)

    Maybe the intention was 'LSTM' and not 'CLSTM'.

    Using 'ocropus' and linking to ocropy might confuse anyone that does not know ocropy's history.

    opened by amitdo 2
  • TeluguOCR + Chamanti OCR

    TeluguOCR + Chamanti OCR

    Banti Telugu OCR: https://github.com/TeluguOCR/banti_telugu_ocr "This framework relies on the ability of a segmentation algorithm to break the text in to glyphs."

    Chamanti OCR: https://github.com/rakeshvar/chamanti_ocr "It will not rely on segmentation algorithms (at the glyph level), making it ideal for highly agglutinative scripts like Arabic, Devanagari etc. We will be starting with Telugu however."

    It is hard to guess for me, how good the recognition work, because I don't understand Telugu and I haven't found any results, discussions, blog posts (which I can read). But the project looks IMO very promising from a technical point. CC @rakeshvar

    opened by zuphilip 2
  • Add Attention-OCR

    Add Attention-OCR

    Just stumbled upon this. I'm not quite sure if it really fits into the "Engines" section, since it's pretty bare-bones. Judging from the README, it's also targeted at natural scene images, but reading through the architecture I don't see any reason why it shouldn't work for printed text as well. Unfortunately I was unable to find a corresponding paper that could shed more light on the inner workings, but since the repository is pretty fresh (~ 3 weeks at the time of this PR), maybe it's still a work in progress and the publication will follow at some later point.

    opened by jbaiter 2
  • Add OCR4all to OCR as a Service

    Add OCR4all to OCR as a Service

    OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings. Reul, Christian; Christ, Dennis; Hartelt, Alexander; Balbach, Nico; Wehner, Maximilian; Springmann, Uwe; Wick, Christoph; Grundig, Christine; Büttner, Andreas; Puppe, Frank in Applied Sciences (2019). 9(22).

    https://doi.org/10.3390/app9224853

    opened by diegosiqueir4 1
  • Add few Ground Truth Repositories

    Add few Ground Truth Repositories

    I have added small repositories for ground truth I have worked on, with on found while searching for 19th century gt sets. I would like to add https://github.com/gesaretto/paleo_ocr but the license is not clear (I have open an issue for this). I think the CLTK link is a bit of a stretch, as they do not provide any ground truth for OCR (to my knowledge)

    opened by PonteIneptique 1
  • Update Attention-OCR

    Update Attention-OCR

    The Attention-OCR library that's linked to in this repo won't run on any of the recent Tensorflow versions or on ML engine. I maintain a fork that's being kept up-to-date and has some wrappers around the original model to provide better tooling. If you don't mind, it would make more sense to link to the fork instead. The original model is, of course, attributed to.

    opened by emedvedev 1
  • Add normcap

    Add normcap

    Add normcap.

    OCR powered screen-capture tool to capture information instead of images.

    https://github.com/dynobo/normcap

    (I couldn't figure where best to put it on the list, otherwise this would be a PR.)

    opened by Shayan-To 0
  • Require the IIT-CDIP Test Collection

    Require the IIT-CDIP Test Collection

    Dear kba,

    Hello, I am Junfeng, a student from the University of Tokyo.

    I am wondering do you have the IIT-CDIP Test Collection Dataset? If so, could you please share it via google drive for academic usage?

    Thank you!

    Best regards, Junfeng

    opened by Coldog2333 1
  • Order the ground truth section by type ?

    Order the ground truth section by type ?

    Hi there ! I found out that a GT section was added while I was tempted to create my own awesome list. One thing that I think would be great is categorizing a little more this section (Manuscript / Early print / Modern / Contemporaneous ?). That would probably be a better way to browse these data.

    opened by PonteIneptique 1
Owner
Konstantin Baierer
Ⓐ ಥ_ಥ (╯°□°)╯︵ ┻━┻ ★。・:*¯\_(ツ)_/¯*:・゚★
Konstantin Baierer
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
This is a repository to learn and get more computer vision skills, make robotics projects integrating the computer vision as a perception tool and create a lot of awesome advanced controllers for the robots of the future.

This is a repository to learn and get more computer vision skills, make robotics projects integrating the computer vision as a perception tool and create a lot of awesome advanced controllers for the robots of the future.

Elkin Javier Guerra Galeano 17 Nov 3, 2022
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

Jaided AI 16.7k Jan 3, 2023
A Python wrapper for the tesseract-ocr API

tesserocr A simple, Pillow-friendly, wrapper around the tesseract-ocr API for Optical Character Recognition (OCR). tesserocr integrates directly with

Fayez 1.7k Dec 31, 2022
FastOCR is a desktop application for OCR API.

FastOCR FastOCR is a desktop application for OCR API. Installation Arch Linux fastocr-git @ AUR Build from AUR or install with your favorite AUR helpe

Bruce Zhang 58 Jan 7, 2023
OCR-D-compliant page segmentation

ocrd_segment This repository aims to provide a number of OCR-D-compliant processors for layout analysis and evaluation. Installation In your virtual e

OCR-D 59 Sep 10, 2022
OCR software for recognition of handwritten text

Handwriting OCR The project tries to create software for recognition of a handwritten text from photos (also for Czech language). It uses computer vis

Břetislav Hájek 562 Jan 3, 2023
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023
A pure pytorch implemented ocr project including text detection and recognition

ocr.pytorch A pure pytorch implemented ocr project. Text detection is based CTPN and text recognition is based CRNN. More detection and recognition me

coura 444 Dec 30, 2022
python ocr using tesseract/ with EAST opencv detector

pytextractor python ocr using tesseract/ with EAST opencv text detector Uses the EAST opencv detector defined here with pytesseract to extract text(de

Danny Crasto 38 Dec 5, 2022
Run tesseract with the tesserocr bindings with @OCR-D's interfaces

ocrd_tesserocr Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr Introduction This package offers OCR-D complia

OCR-D 38 Oct 14, 2022
A set of workflows for corpus building through OCR, post-correction and normalisation

PICCL: Philosophical Integrator of Computational and Corpus Libraries PICCL offers a workflow for corpus building and builds on a variety of tools. Th

Language Machines 41 Dec 27, 2022
Tensorflow-based CNN+LSTM trained with CTC-loss for OCR

Overview This collection demonstrates how to construct and train a deep, bidirectional stacked LSTM using CNN features as input with CTC loss to perfo

Jerod Weinman 489 Dec 21, 2022
🖺 OCR using tensorflow with attention

tensorflow-ocr ?? OCR using tensorflow with attention, batteries included Installation git clone --recursive http://github.com/pannous/tensorflow-ocr

null 646 Nov 11, 2022
This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"

Gated Recurrent Convolution Neural Network for OCR This project is an implementation of the GRCNN for OCR. For details, please refer to the paper: htt

null 90 Dec 22, 2022
A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

Maxim 32 Jul 24, 2022
MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9

Deep Insight 99 Nov 1, 2022