The CIS OCR PostCorrectionTool

Overview

The CIS OCR Post Correction Tool PoCoTo

Source code for the Java-based PoCoTo client enabling fast interactive batch corrections of complete OCR error series in OCR'ed historical documents. For a detailed description see the PoCoTo Manual.

The lastest compiled binary can be downloaded here.

References

PoCoTo has originally been written by Thorsten Vobl as part of his master's thesis in computational linguistics at CIS during the IMPACT project.

It has been further developed as a CLARIN-D Kurationsprojekt by Florian Fink and Uwe Springmann at CIS.

Its underlying technology is described in the following publication:

Vobl, Thorsten, Annette Gotscharek, Uli Reffle, Christoph Ringlstetter, and Klaus U. Schulz. 2014. “PoCoTo - an Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts.” In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 57–61. DATeCH ’14. New York, NY, USA: ACM. doi:http://doi.org/10.1145/2595188.2595197.

Comments
  • Time busy before ready to edit ocr

    Time busy before ready to edit ocr

    I have a book with abbyy(v11)-ocr which takes approx. 4 minutes on a modern cpu before a page is ready to be edited. ocrcorrection-process is busy during this period. Example on request. Kind regards, Barth @ UB Uni Heidelberg

    opened by jbarth-ubhd 5
  • Java version required for PoCoTo

    Java version required for PoCoTo

    PoCoTo is based on netbeans, from approx. 2014

    To my experience, newer java releases do not work with PoCoTo.

    • Oracle Java 7u80 does not allow »Check for Updates« (possibly TLS version?)
    • Oracle Java 8u5 seems to work
    • openjdk-8-jre version 8u222-b10-1ubuntu1~18.04.1 seems to work, too [line added 2019-08-09]
    • newer java versions (Oracle, openjdk) (~-8-~ 9 ... 12) do not work
    opened by jbarth-ubhd 1
  • export problem with large files

    export problem with large files

    exporting a large file (more than 3600 subimages) leads to a timeout:

    [INFO] 2016-10-05 20:33:26,440 - [java.lang.Class] PoCoTo version: 16.01.3
    [INFO] 2016-10-05 20:33:26,451 - [java.lang.Class] Setup logging base dir '/home/uvius/.ocrcorrection/dev'
    [INFO] 2016-10-05 20:33:27,462 - [jav.gui.main.MainController] Programmstart
    [INFO] 2016-10-05 20:58:21,884 - [jav.gui.main.MainController] MainTopComponent # Document changed
    [INFO] 2016-10-05 20:58:23,412 - [jav.gui.image.CompleteImageTopComponent] loadImage(/home/uvius/data/OCR/Kallimachos/Itf954/ocro-tif/DE-20__I_t_f_954__0001__R0001r__ro357564684568544579089__000__r1__TextRegion__heading.tif, 0.400000)
    [INFO] 2016-10-05 20:58:30,539 - [jav.gui.image.CompleteImageTopComponent] loadImage(/home/uvius/data/OCR/Kallimachos/Itf954/ocro-tif/DE-20__I_t_f_954__0001__R0001r__ro357564684568544579089__001__r2__TextRegion__page-number.tif, 0.400000)
    [INFO] 2016-10-05 20:58:37,466 - [jav.gui.image.CompleteImageTopComponent] loadImage(/home/uvius/data/OCR/Kallimachos/Itf954/ocro-tif/DE-20__I_t_f_954__0001__R0001r__ro357564684568544579089__002__r5__TextRegion__paragraph.tif, 0.400000)
    [INFO] 2016-10-05 20:58:43,084 - [jav.gui.image.CompleteImageTopComponent] loadImage(/home/uvius/data/OCR/Kallimachos/Itf954/ocro-tif/DE-20__I_t_f_954__0001__R0001r__ro357564684568544579089__003__r8__TextRegion__heading.tif, 0.400000)
    [INFO] 2016-10-05 20:58:46,199 - [jav.gui.image.CompleteImageTopComponent] loadImage(/home/uvius/data/OCR/Kallimachos/Itf954/ocro-tif/DE-20__I_t_f_954__0001__R0001r__ro357564684568544579089__004__r10__TextRegion__paragraph.tif, 0.400000)
    [INFO] 2016-10-05 20:58:50,115 - [jav.gui.image.CompleteImageTopComponent] loadImage(/home/uvius/data/OCR/Kallimachos/Itf954/ocro-tif/DE-20__I_t_f_954__0001__R0001r__ro357564684568544579089__005__r11__TextRegion__paragraph.tif, 0.400000)
    [INFO] 2016-10-05 20:59:12,314 - [jav.gui.image.CompleteImageTopComponent] loadImage(/home/uvius/data/OCR/Kallimachos/Itf954/ocro-tif/DE-20__I_t_f_954__0001__R0001r__ro357564684568544579089__006__r12__TextRegion__paragraph.tif, 0.400000)
    [INFO] 2016-10-05 20:59:20,706 - [jav.gui.image.CompleteImageTopComponent] loadImage(/home/uvius/data/OCR/Kallimachos/Itf954/ocro-tif/DE-20__I_t_f_954__0002__R0001v__ro357564684568544579089__000__r1__TextRegion__heading.tif, 0.400000)
    [INFO] 2016-10-05 20:59:24,953 - [jav.gui.image.CompleteImageTopComponent] loadImage(/home/uvius/data/OCR/Kallimachos/Itf954/ocro-tif/DE-20__I_t_f_954__0002__R0001v__ro357564684568544579089__001__r2__TextRegion__paragraph.tif, 0.400000)
    [INFO] 2016-10-05 21:00:34,646 - [jav.gui.main.MainController] profiler_service_url: 'http://localhost:8080/axis2/services/ProfilerWebService'
    [INFO] 2016-10-05 21:00:42,644 - [jav.gui.main.MainController] profiler_service_url: 'http://localhost:8080/axis2/services/ProfilerWebService'
    [INFO] 2016-10-05 21:00:42,722 - [jav.gui.main.MainController$DocumentProfiler] exporting document ...
    [ERROR] 2016-10-05 21:01:16,643 - Exception:
    java.sql.SQLException: Login timeout
        at org.h2.jdbcx.JdbcConnectionPool.getConnection(JdbcConnectionPool.java:208)
        at jav.correctionBackend.Document.getTokenByIndex(Document.java:1351)
        at jav.correctionBackend.PageIterator.next(Document.java:2322)
        at jav.correctionBackend.PageIterator.next(Document.java:2274)
        at jav.correctionBackend.OcrXmlExporter.export(OcrXmlExporter.java:57)
        at jav.correctionBackend.Document.exportAsDocXML(Document.java:1604)
        at jav.gui.main.MainController$DocumentProfiler.run(MainController.java:1117)
        at jav.gui.main.MainController$DocumentProfiler.run(MainController.java:1066)
        at org.netbeans.modules.progress.ui.RunOffEDTImpl$ProgressBackgroundRunner.runBackground(RunOffEDTImpl.java:486)
        at org.netbeans.modules.progress.ui.AbstractWindowRunner.call(AbstractWindowRunner.java:108)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.openide.util.RequestProcessor$Task.run(RequestProcessor.java:1423)
        at org.openide.util.RequestProcessor$Processor.run(RequestProcessor.java:2033)
    [ERROR] 2016-10-05 21:01:16,646 - [jav.gui.main.MainController$DocumentProfiler] profiling error: null
    [ERROR] 2016-10-05 21:08:57,934 - [jav.correctionBackend.SpreadIndexDocument] SQLException: Login timeout
    
    opened by uvius 1
  • Add CodeQL workflow for GitHub code scanning

    Add CodeQL workflow for GitHub code scanning

    Hi cisocrgroup/PoCoTo!

    This is a one-off automatically generated pull request from LGTM.com :robot:. You might have heard that we’ve integrated LGTM’s underlying CodeQL analysis engine natively into GitHub. The result is GitHub code scanning!

    With LGTM fully integrated into code scanning, we are focused on improving CodeQL within the native GitHub code scanning experience. In order to take advantage of current and future improvements to our analysis capabilities, we suggest you enable code scanning on your repository. Please take a look at our blog post for more information.

    This pull request enables code scanning by adding an auto-generated codeql.yml workflow file for GitHub Actions to your repository — take a look! We tested it before opening this pull request, so all should be working :heavy_check_mark:. In fact, you might already have seen some alerts appear on this pull request!

    Where needed and if possible, we’ve adjusted the configuration to the needs of your particular repository. But of course, you should feel free to tweak it further! Check this page for detailed documentation.

    Questions? Check out the FAQ below!

    FAQ

    Click here to expand the FAQ section

    How often will the code scanning analysis run?

    By default, code scanning will trigger a scan with the CodeQL engine on the following events:

    • On every pull request — to flag up potential security problems for you to investigate before merging a PR.
    • On every push to your default branch and other protected branches — this keeps the analysis results on your repository’s Security tab up to date.
    • Once a week at a fixed time — to make sure you benefit from the latest updated security analysis even when no code was committed or PRs were opened.

    What will this cost?

    Nothing! The CodeQL engine will run inside GitHub Actions, making use of your unlimited free compute minutes for public repositories.

    What types of problems does CodeQL find?

    The CodeQL engine that powers GitHub code scanning is the exact same engine that powers LGTM.com. The exact set of rules has been tweaked slightly, but you should see almost exactly the same types of alerts as you were used to on LGTM.com: we’ve enabled the security-and-quality query suite for you.

    How do I upgrade my CodeQL engine?

    No need! New versions of the CodeQL analysis are constantly deployed on GitHub.com; your repository will automatically benefit from the most recently released version.

    The analysis doesn’t seem to be working

    If you get an error in GitHub Actions that indicates that CodeQL wasn’t able to analyze your code, please follow the instructions here to debug the analysis.

    How do I disable LGTM.com?

    If you have LGTM’s automatic pull request analysis enabled, then you can follow these steps to disable the LGTM pull request analysis. You don’t actually need to remove your repository from LGTM.com; it will automatically be removed in the next few months as part of the deprecation of LGTM.com (more info here).

    Which source code hosting platforms does code scanning support?

    GitHub code scanning is deeply integrated within GitHub itself. If you’d like to scan source code that is hosted elsewhere, we suggest that you create a mirror of that code on GitHub.

    How do I know this PR is legitimate?

    This PR is filed by the official LGTM.com GitHub App, in line with the deprecation timeline that was announced on the official GitHub Blog. The proposed GitHub Action workflow uses the official open source GitHub CodeQL Action. If you have any other questions or concerns, please join the discussion here in the official GitHub community!

    I have another question / how do I get in touch?

    Please join the discussion here to ask further questions and send us suggestions!

    opened by lgtm-com[bot] 0
  • update project platform to NetBeans 14

    update project platform to NetBeans 14

    I had to build it, because the 'latest' Linux release binaries did no run on Debian - see #14 Tried to change as little as possible - no refactoring for current NB/Java features.

    opened by nicodex 0
  • Not working, requires Profiler Server

    Not working, requires Profiler Server

    It should be pointed out more prominently that program requires a profiler server backend and that the default profiler server is no longer in operation, thus rendering the application unfunctional.

    On the OCR-D gitter chat it was suggested that https://hub.docker.com/r/ffink/profiler might provide a replacement.

    It would also be good to add information that the project has been retired in favor of https://github.com/cisocrgroup/pocoweb (if that's the case).

    Thank you.

    opened by cboulanger 0
  • Not starting in linux version

    Not starting in linux version

    Downloaded, wanted to try, here what happened:

    ~/Downloads/ocrcorrection/bin$ ./ocrcorrection
    Java HotSpot(TM) 64-Bit Server VM warning: Ignoring option PermSize; support was removed in 8.0
    Java HotSpot(TM) 64-Bit Server VM warning: Ignoring option MaxPermSize; support was removed in 8.0
    WARNING: An illegal reflective access operation has occurred
    WARNING: Illegal reflective access by org.netbeans.ProxyURLStreamHandlerFactory (file:/home/me/Downloads/ocrcorrection/platform/lib/boot.jar) to field java.net.URL.handler
    WARNING: Please consider reporting this to the maintainers of org.netbeans.ProxyURLStreamHandlerFactory
    WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
    WARNING: All illegal access operations will be denied in a future release
    
    

    Any ideas what could be tried? Thanks in advance. Actually I am not sure yet what this tool can achieve, just wanted to test it, if it would run here, ubuntu 18.04.

    opened by michaelsjackson 1
  • Abbyy XML: empty <charParams> → NullPointerException

    Abbyy XML: empty → NullPointerException

    Some Abbyy11 releases (and perhaps other) are sometimes generate xml with empty tags, e. g.

    <charParams l="282" t="100"... />
    

    This leads to

    [ERROR] 2019-08-08 09:15:10,351 - Exception:
    java.lang.NullPointerException
            at jav.correctionBackend.parser.AbbyyXmlChar.<init>(AbbyyXmlChar.java:24)
    

    see in repository: AbbyyXmlChar.java

    opened by jbarth-ubhd 0
  • This page has no tokens

    This page has no tokens

    By mistake I reported the problem here: https://github.com/cisocrgroup/Resources as the issue #3 including the Linde133.xml file. The project contains the following files: `user@pocoto:~$ ls -Rlh Linde4PoCoTo/ Linde4PoCoTo/: total 92K drwxrwxrwx 2 user root 4.0K Jun 7 2014 img -rw-r--r-- 1 user user 1.1M Jun 11 11:43 Linde133.h2.db -rw-r--r-- 1 user user 100 Jun 11 11:43 Linde133.lock.db -rw-r--r-- 1 user user 415 Jun 9 09:06 Linde133.ocrproject -rw-r--r-- 1 user user 6.6K Jun 9 09:17 Linde133.trace.db drwxrwxrwx 2 user root 4.0K Jun 7 2014 xml

    Linde4PoCoTo/img: total 30M -rwxrwxrwx 1 user root 30M Jun 7 2014 Linde133.tif

    Linde4PoCoTo/xml: total 148K -rwxrwxrwx 1 user root 148K Jun 7 2014 Linde133.xml user@pocoto:~$ `

    opened by jsbien 0
Owner
CIS OCR Group
Software for OCR of historical documents from CIS, Munich
CIS OCR Group
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

Jaided AI 16.7k Jan 3, 2023
A Python wrapper for the tesseract-ocr API

tesserocr A simple, Pillow-friendly, wrapper around the tesseract-ocr API for Optical Character Recognition (OCR). tesserocr integrates directly with

Fayez 1.7k Dec 31, 2022
FastOCR is a desktop application for OCR API.

FastOCR FastOCR is a desktop application for OCR API. Installation Arch Linux fastocr-git @ AUR Build from AUR or install with your favorite AUR helpe

Bruce Zhang 58 Jan 7, 2023
OCR-D-compliant page segmentation

ocrd_segment This repository aims to provide a number of OCR-D-compliant processors for layout analysis and evaluation. Installation In your virtual e

OCR-D 59 Sep 10, 2022
OCR software for recognition of handwritten text

Handwriting OCR The project tries to create software for recognition of a handwritten text from photos (also for Czech language). It uses computer vis

Břetislav Hájek 562 Jan 3, 2023
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023
A pure pytorch implemented ocr project including text detection and recognition

ocr.pytorch A pure pytorch implemented ocr project. Text detection is based CTPN and text recognition is based CRNN. More detection and recognition me

coura 444 Dec 30, 2022
python ocr using tesseract/ with EAST opencv detector

pytextractor python ocr using tesseract/ with EAST opencv text detector Uses the EAST opencv detector defined here with pytesseract to extract text(de

Danny Crasto 38 Dec 5, 2022
Run tesseract with the tesserocr bindings with @OCR-D's interfaces

ocrd_tesserocr Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr Introduction This package offers OCR-D complia

OCR-D 38 Oct 14, 2022
A set of workflows for corpus building through OCR, post-correction and normalisation

PICCL: Philosophical Integrator of Computational and Corpus Libraries PICCL offers a workflow for corpus building and builds on a variety of tools. Th

Language Machines 41 Dec 27, 2022
Tensorflow-based CNN+LSTM trained with CTC-loss for OCR

Overview This collection demonstrates how to construct and train a deep, bidirectional stacked LSTM using CNN features as input with CTC loss to perfo

Jerod Weinman 489 Dec 21, 2022
🖺 OCR using tensorflow with attention

tensorflow-ocr ?? OCR using tensorflow with attention, batteries included Installation git clone --recursive http://github.com/pannous/tensorflow-ocr

null 646 Nov 11, 2022
This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"

Gated Recurrent Convolution Neural Network for OCR This project is an implementation of the GRCNN for OCR. For details, please refer to the paper: htt

null 90 Dec 22, 2022
A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

Maxim 32 Jul 24, 2022
MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9

Deep Insight 99 Nov 1, 2022
CNN+LSTM+CTC based OCR implemented using tensorflow.

CNN_LSTM_CTC_Tensorflow CNN+LSTM+CTC based OCR(Optical Character Recognition) implemented using tensorflow. Note: there is No restriction on the numbe

Watson Yang 356 Dec 8, 2022