OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched



Build Status PyPI version Homebrew version ReadTheDocs Python versions

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

ocrmypdf                      # it's a scriptable command line program
   -l eng+fra                 # it supports multiple languages
   --rotate-pages             # it can fix pages that are misrotated
   --deskew                   # it can deskew crooked PDFs!
   --title "My PDF"           # it can change output metadata
   --jobs 4                   # it uses multiple cores by default
   --output-type pdfa         # it produces PDF/A by default
   input_scanned.pdf          # takes PDF input (or images)
   output_searchable.pdf      # produces validated PDF output

See the release notes for details on the latest changes.

Main features

  • Generates a searchable PDF/A file from a regular PDF
  • Places OCR text accurately below the image to ease copy / paste
  • Keeps the exact resolution of the original embedded images
  • When possible, inserts OCR information as a "lossless" operation without disrupting any other content
  • Optimizes PDF images, often producing files smaller than the input file
  • If requested, deskews and/or cleans the image before performing OCR
  • Validates input and output files
  • Distributes work across all available CPU cores
  • Uses Tesseract OCR engine to recognize more than 100 languages
  • Scales properly to handle files with thousands of pages
  • Battle-tested on millions of PDFs

For details: please consult the documentation.


I searched the web for a free command line tool to OCR PDF files: I found many, but none of them were really satisfying:

  • Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
  • Or they did not handle accents and multilingual characters
  • Or they changed the resolution of the embedded images
  • Or they generated ridiculously large PDF files
  • Or they crashed when trying to OCR
  • Or they did not produce valid PDF files
  • On top of that none of them produced PDF/A files (format dedicated for long time storage)

...so I decided to develop my own tool.


Linux, Windows, macOS and FreeBSD are supported. Docker images are also available.

Operating system Install command
Debian, Ubuntu apt install ocrmypdf
Windows Subsystem for Linux apt install ocrmypdf
Fedora dnf install ocrmypdf
macOS brew install ocrmypdf
LinuxBrew brew install ocrmypdf
FreeBSD pkg install py37-ocrmypdf
Conda conda install ocrmypdf

For everyone else, see our documentation for installation steps.


OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users, you can often find packages that provide language packs:

# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr

# Debian/Ubuntu users
apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language pack

# Arch Linux users
pacman -S tesseract-data-eng tesseract-data-deu # Example: Install the English and German language packs

# brew macOS users
brew install tesseract-lang

You can then pass the -l LANG argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple languages can be requested.

Documentation and support

Once OCRmyPDF is installed, the built-in help which explains the command syntax and options can be accessed via:

ocrmypdf --help

Our documentation is served on Read the Docs.

Please report issues on our GitHub issues page, and follow the issue template for quick response.


In addition to the required Python version (3.6+), OCRmyPDF requires external program installations of Ghostscript, Tesseract OCR, QPDF, and Leptonica. OCRmyPDF is pure Python, but uses CFFI to portably generate library bindings. OCRmyPDF works on pretty much everything: Linux, macOS, Windows and FreeBSD.

Press & Media

Business enquiries

OCRmyPDF would not be the software that it is today without companies and users choosing to provide support for feature development and consulting enquiries. We are happy to discuss all enquiries, whether for extending the existing feature set, or integrating OCRmyPDF into a larger system.


The OCRmyPDF software is licensed under the Mozilla Public License 2.0 (MPL-2.0). This license permits integration of OCRmyPDF with other code, included commercial and closed source, but asks you to publish source-level modifications you make to OCRmyPDF.

Some components of OCRmyPDF have other licenses, as noted in those files and the debian/copyright file. Most files in misc/ use the MIT license, and the documentation and test files are generally licensed under Creative Commons ShareAlike 4.0 (CC-BY-SA 4.0).


The software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  • Improve user experience for Windows 10

    Improve user experience for Windows 10


    Describe the issue I've managed to run OCRmyPDF.exe on Windows 10 without wsl.

    To Reproduce I've made fork and added some quick fixes in this commit: https://github.com/dibu28/OCRmyPDF/commit/543088e79e8649e968d02d8fd268123255607dc1

    Fixes are:

    1. in leptonica.py librray name is liblept-5 instead of lept
    2. in ghostscript.py 2.1) executable name is gswin64c.exe instead of gs 2.2) NamedTemporaryFile doesnt work properly and gs could not modify tmp file with access denied error. (so as a temporary workaround I'm adding "_1" to temp file name and then removing file. There could be some better way)
    3. in _pipeline.py and helpers.py files - symlinking to temp folder on windows requires Admin privelegies. So instead of simlinking I'm just copying files.
    4. in _sync.py file - os.path.samefile is returning error: "OSError: [WinError 1] Incorrect function: 'nul'"

    So after those changes and installin dependencies it started to work from command line like this: OCRmyPDF.exe input.pdf output.pdf

    Dependencies and binaries I'm using: https://www.python.org/ftp/python/3.7.5/python-3.7.5-amd64.exe https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs950/gs950w64.exe https://github.com/qpdf/qpdf/releases/download/release-qpdf-9.0.2/qpdf-9.0.2-bin-msvc64.zip

    Add paths to PATH variable: set PATH=%PATH%;C:\Program Files\Tesseract-OCR; set PATH=%PATH%;C:\Program Files\gs\gs9.50\bin; set PATH=%PATH%;C:\qpdf\qpdf-9.0.2-bin-msvc64\qpdf-9.0.2\bin;

    python setup.py build
    OCRmyPDF.exe input.pdf output.pdf

    Expected behavior Can we add some workarounds using conditions based on os type?


    • OS: Windows 10
    • OCRmyPDF Version: v9.0.5

    Additional context

    opened by dibu28 57
  • OSError: cannot load library 'C:\Program Files\Tesseract-OCR\liblept-5.dll': error 0x7e

    OSError: cannot load library 'C:\Program Files\Tesseract-OCR\liblept-5.dll': error 0x7e

    As in #631 I am getting the same error. Instead of 0x7f I am getting 0x7e
    I am using Python 3.9.2 64 bit, Windows 10 64 bit and OCRMYPDF = 12.5.0 I cant solve the problem as solved #631 by changing leptonica.py, that is by opening zlib.dll before liblept-5.dll.

    When I run the code ocrmypdf --help or ocrmypdf --version it displays same OSerror.

    Does anyone know what to do? @jbarlow83

    opened by meet1919 28
  • Add interword space option to HOCR pdf renderer

    Add interword space option to HOCR pdf renderer

    This pull request adds a new advanced option --interword-spaces to OCRmyPDF to allow the hocr renderer to produce PDF output compatible with PDF.js and potentially other viewers that have difficulty detecting phrases, lines, and paragraphs in separately placed text layers. This new switch is a workaround for limitations of the PDF.js viewer described in https://github.com/jbarlow83/OCRmyPDF/issues/133.


    OCRmyPDF justifiably prioritizes the accurate placement of words on the text layer as individual glyphs. Most PDF viewers have heuristics that allow them to identify paragraphs, lines, and phrases while searching and to insert the correct inter-word spacing when copying and pasting. PDF.js has over 80 issues flagged with 4-text-selection and there have been a number of pull requests to address the issue that have apparently gotten bogged down with edge cases, performance concerns, and perhaps the inherent challenges of a pure Javascript and HTML approach to PDF rendering.


    The goal of this pull request is to add an unobtrusive option to OCRmyPDF to allow it to produce PDF.js compatible output for those that must support PDF.js as a business requirement. Specifically, this PR follows the code conventions by adding an advanced option --interword-spaces to the options parser and ensures this option is available to the hocrtransform.py renderer. When set to true, the HOCR renderer will add an additional space at the end of each text element before drawing it on the text layer. This option does not apply to other pdf renderers in OCRmyPDF, is turned off by default, and issues a warning if used without the --pdf-renderer hocr option also set.


    This PR added a new section to the advanced documentation for the new option, a note on the 'hocr' renderer description about the option in the same file, and a note that this is available in the introduction where there is a relevant discussion of PDF as a layout format dependent on the viewer to interpret the structure of the document in terms of words, sentences, and paragraphs.


    We confirmed that existing tests that exercised this code continue to pass. We encountered some seemingly preexisting failures in other tests. We explored the option of adding additional tests for confirm the warning is provided and the output is as expected, and would welcome guidance as to where that test should be placed and how best to combine it with RENDERERS tests in the test_main.py or the more specific test_hocrtransform.py.

    Sample PDF Output

    The following file was processed with this option set to true. When loaded into the latest PDF.js viewer, multi-word search and copy and paste are improved over the standard HOCR output:

    Input PDF: https://github.com/logikcull/OCRmyPDF/blob/master/tests/resources/linn.pdf

    # original command 
    ocrmypdf --output-type pdf --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.pdf

    Output PDF: https://www.dropbox.com/s/2ugzxldqsvy8q6x/output.linn.hocr.pdf

    Behavior when loaded into latest PDF.js viewer -- note that you have to remove spaces to find multiple words. Selecting and pasting the text also has spaces removed:

    screen shot 2018-03-01 at 12 03 10 pm
    # command with new --interword-spaces option
    ocrmypdf --output-type pdf --interword-spaces --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.interword.pdf

    Output PDF: https://www.dropbox.com/s/tukp6ftpjebe1gh/output.linn.hocr.interword.pdf

    Behavior when loaded into latest PDF.js viewer -- note that you can find multiple words separated by spaces. Copy and paste is also improved: screen shot 2018-03-01 at 12 03 42 pm

    Testing in Adobe Reader and Chrome's native PDF viewer showed that files rendered with the new option continued to perform as well or better when searching and copying and pasting. Apple Preview handled neither output file particularly well so we think there has at least been no harm done.

    Alternative Approaches

    If this approach of adding a new option with a warning if used without hocr is too disruptive, we could also consider contributing a new pipeline task for a fifth renderer titled 'hocr-sloppy-text' or something similar that runs a nearly identical version of hocrtransform.py with the space suffix turned on by default. This approach has the serious downside of repeating complex code, but the upside of leaving the existing hocr rendered behavior 100% unchanged and opening the way in the future for other "sloppy-text" fixes required to produce PDFs for simpler viewers like PDF.js.

    Related Issues


    • 133: Some hints that Tesseract upgrades might provide some relief, but underlying conclusion was that PDF.js has a naive implementation of text selection and word boundaries (https://github.com/jbarlow83/OCRmyPDF/issues/133).


    • 1235 December 2017: https://github.com/tesseract-ocr/tesseract/issues/1235 includes good explanation of reason for space detection issues: "Known problem. Root cause is PDF spec which forces heuristics into text extraction, and Preview is well known to have some of the wonkiest heuristics."
    • 699 https://github.com/tesseract-ocr/tesseract/issues/699#issuecomment-277486345
    • 382 https://github.com/tesseract-ocr/tesseract/issues/382
    • 337 https://github.com/tesseract-ocr/tesseract/issues/337


    • 7310: Super helpful discussion of HTML divs: https://github.com/mozilla/pdf.js/issues/7310
    • 6657: https://github.com/mozilla/pdf.js/issues/6657
    • Related PR not merged: https://github.com/mozilla/pdf.js/pull/5783
    • Dozens of text selection issues: https://github.com/mozilla/pdf.js/issues?q=is%3Aopen+is%3Aissue+label%3A4-text-selection
    opened by cforcey 28
  • NixOS packaging issues

    NixOS packaging issues

    Hi there

    I'm currently trying to write a package file for ORCmyPDF for NixOS. I think I'm already pretty far but now I'm stuck on an error that I have no idea how to fix, as it doesn't seem to give any indication, where the problem actually occurs.

    Anyway, I do get this error when it's trying to build OCRmyPDF:

    building path(s) ‘/nix/store/kdpr7qaz85lrls5mwqyvgrfi5v811i5q-ORCmyPDF-5.4.3’
    unpacking sources
    unpacking source archive /nix/store/ajl9ibrhpbbrrccnyb7s7rl4ix8w7k48-source
    source root is source
    setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/tests/test_userunit.py
    patching sources
    Skipping external program tests because of --force
    Traceback (most recent call last):
      File "nix_run_setup.py", line 8, in <module>
        exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
      File "setup.py", line 245, in <module>
      File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/core.py", line 108, in setup
        _setup_distribution = dist = klass(attrs)
      File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 338, in __init__
        _Distribution.__init__(self, attrs)
      File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/dist.py", line 281, in __init__
      File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 471, in finalize_options
        ep.load()(self, ep.name, value)
      File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 188, in cffi_modules
        add_cffi_module(dist, cffi_module)
      File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 49, in add_cffi_module
        execfile(build_file_name, mod_vars)
      File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 22, in execfile
        src = f.read()
      File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
    builder for ‘/nix/store/jsnfzz199dy49viv14l1is2i1d2r3lq9-ORCmyPDF-5.4.3.drv’ failed with exit code 1
    cannot build derivation ‘/nix/store/niq3y1rw30sqx5gp5jwrd273hlv6xhb2-system-path.drv’: 1 dependencies couldn't be built
    cannot build derivation ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’: 1 dependencies couldn't be built
    error: build of ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’ failed

    The current nix expression that I use to try to build it looks like:

    { lib, fetchFromGitHub, python3, callPackage, pytest, unpaper, ghostscript, tesseract, qpdf }:
    with python3.pkgs;
      ruffus = callPackage ("/tankJL/opt/ruffus.nix") {};
      img2pdf = callPackage ("/tankJL/opt/img2pdf.nix") {};
    buildPythonApplication rec {
      version = "5.4.3";
      name = "ORCmyPDF-${version}";
      src = fetchFromGitHub {
        owner = "jbarlow83";
        repo = "OCRmyPDF";
        rev = version;
        sha256 = "0vnn6g69vkqldbx76llmyz8h9ia7mkxcp290mxdsydy4bjjik6zf";
      postPatch = ''
        substituteInPlace requirements.txt \
          --replace "ruffus == 2.6.3" "ruffus" \
          --replace "Pillow == 4.3.0" "Pillow" \
          --replace "reportlab == 3.4.0" "reportlab" \
          --replace "PyPDF2 == 1.26.0" "PyPDF2" \
          --replace "img2pdf == 0.2.4" "img2pdf" \
          --replace "cffi == 1.11.2" "cffi"
        substituteInPlace test_requirements.txt \
          --replace "pytest >= 3.0" "pytest"
        export SETUPTOOLS_SCM_PRETEND_VERSION="${version}"
      buildInputs = [ pytest pytest_xdist pytestcov setuptools_scm ];
      propagatedBuildInputs = [
      meta = {
        homepage = https://github.com/jbarlow83/OCRmyPDF;
        description = "Adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.";
        license = lib.licenses.mit;
        maintainers = with lib.maintainers; [ hyper_ch ];

    I understand that there seems to be a problem with one of the files but I can't figure out where the problem actually occurs.

    opened by sjau 26
  • ocrmypdf 11.4.4 failed to build on apple silicon

    ocrmypdf 11.4.4 failed to build on apple silicon

    Describe the bug ocrmypdf 11.4.4 failed to build on apple silicon

    build error message (run log url):

    ==> /opt/homebrew/Cellar/ocrmypdf/11.4.4/bin/ocrmypdf -f -q --deskew /opt/homebrew/Library/Homebrew/test/support/fixtures/test.pdf ocr.pdf
    Traceback (most recent call last):
      File "/opt/homebrew/Cellar/ocrmypdf/11.4.4/bin/ocrmypdf", line 5, in <module>
        from ocrmypdf.__main__ import run
      File "/opt/homebrew/Cellar/ocrmypdf/11.4.4/libexec/lib/python3.9/site-packages/ocrmypdf/__init__.py", line 10, in <module>
        from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo
      File "/opt/homebrew/Cellar/ocrmypdf/11.4.4/libexec/lib/python3.9/site-packages/ocrmypdf/leptonica.py", line 174, in <module>
        def _stderr_handler(cstr):
    MemoryError: Cannot allocate write+execute memory for ffi.callback(). You might be running on a system that prevents this. For more information, see https://cffi.readthedocs.io/en/latest/using.html#callbacks

    To Reproduce pip installation and run on darwin arm64 system?

    Expected behavior build successfuly

    System (please complete the following information):

    • OS: OSX darwin arm64
    • Python version: python 3.9
    • OCRmyPDF version: Ocrmypdf 11.4.4

    Additional context relates to https://github.com/Homebrew/homebrew-core/pull/68159

    opened by chenrui333 25
  • dependecy problem reportlab - allthough installed...

    dependecy problem reportlab - allthough installed...

    Issue by andreasotto Tue Nov 4 10:44:25 2014 Originally opened as https://github.com/fritz-hh/OCRmyPDF/issues/99

    # ./OCRmyPDF.sh /home/ao/Leerungstermine189973.PDF /home/ao/test.pdf
    Please install the python library reportlab. Exiting...
    # apt-get install python-reportlab
    python-reportlab ist schon die neueste Version.

    .. already installed.

    Debian 6 squeeze

    opened by OCRmyPDF-issuebot 25
  • Using Ubuntu Snap as packaging format

    Using Ubuntu Snap as packaging format

    I took the liberty of creating a snap application recipe "snapcraft.yaml" which enables snapcraft's build plattform to build a working snap application for ocrmypdf.

    Take a look here: https://github.com/alexanderlanganke/ocrmypdf-snap

    While building it pulls in the application using PIP so that it always uses the most recent version. This may make it easier for users to access ocrmypdf.

    So far I am getting the application to build and run but am running into a missing dependancy during runtime. I believe I need to adjust the path for one or two libraries.

    I have also registered this snap (private for now) on snapcraft.

    If you are interested, and I get it working, I would offer to maintain this snap for you or pass it on to you if you wish to do it yourself. Credit for the application will of course go to you! Snapcraft pulls from github so you basically need to get it working once and never touch it again. It will rebuild whenever you push to the linked repository (version bump for example).

    opened by alexanderlanganke 23
  • [13.4.2] lossy compression of pngs into jpegs when it shouldn't

    [13.4.2] lossy compression of pngs into jpegs when it shouldn't

    1. It might be just the older version, but ocrmypdf 12.7.2 seems to compress uncompressed pngs into (lossy) jpegs:
    $ ocrmypdf --version
    $ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
    $ convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
    $ img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
    $ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
    $ pdfimages -list ./Example-uncompress-compress.pdf
    page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
       1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 4157B 4.5%

    I believe it should be running the image through pngquant instead at optimize level 1.

    1. Btw, it's probably not even worth mentioning since, looking at the changelog, I'm fairly certain you've already sorted it out in recent ocrmypdf versions, but small pdfs with small pngs grow instead of shrinking / remaining the same:
    $ ocrmypdf --version
    $ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
    $ img2pdf ./Example.png -o ./Example.pdf
    $ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
    $ stat -c "%n,%s" Example*.* | column -t -s,
    Example-compress.pdf  7799
    Example.pdf           3906
    Example.png           2335

    Though this might also be the pdf format changing to the archival specs...

    1. As a side note, if compute time isn't a factor, I personally found 'optipng -o7' to produce smaller pngs than pngquant and 'jpegrescan -i -t -v' to produce the smallest jpeg, even compared to MozJPEG despite the author saying otherwise oddly enough.

    p.s. forgot to mention the png-to-jpeg bug also happens with some compressed pngs but I haven't bothered trying to replicate this since I believe it should never try to convert bitmap images to jpegs to begin with.

    opened by RamKromberg 21
  • Anaconda - Successful Install but not working

    Anaconda - Successful Install but not working

    Describe the bug (*update: 2022-04-22): Reorder sentences

    What's the problem? I tried installing ocrmypdf using Conda on Windows; it looks successful. I tried to run tesseract tests.jpg, and it works fine. (ocrmypdf) C:\Users\Denz\Downloads>tesseract test.jpg test

    But whenever I run a test pdf, it doesn't output the OCR text. Here is the error log:

    (ocrmypdf) C:\Users\Denz\Downloads>ocrmypdf --force-ocr NeedOCR2.pdf output.pdf
    Scanning contents: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.11page/s]
        1 page already has text! - rasterizing text and running OCR anyway
        1 [tesseract] read_params_file: Can't open pdf
        1 [tesseract] read_params_file: Can't open txt
    OCR: 100%|█████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:07<00:00,  7.41s/page]
    PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.20s/page]
    Recompressing JPEGs: 0image [00:00, ?image/s]
    Deflating JPEGs: 0image [00:00, ?image/s]
    JBIG2: 0item [00:00, ?item/s]
    Optimize ratio: 1.00 savings: -0.0%
    Image optimization did not improve the file - optimizations will not be used
    Output file is a PDF/A-2B (as expected)

    my Environment Packages inside Conda

    (ocrmypdf) C:\Users\Denz\Downloads>conda list
    # packages in environment at C:\Users\Denz\anaconda3\envs\ocrmypdf:
    # Name                    Version                   Build  Channel
    bzip2                     1.0.8                h8ffe710_4    conda-forge
    ca-certificates           2021.10.8            h5b45459_0    conda-forge
    cffi                      1.15.0                   pypi_0    pypi
    chardet                   4.0.0                    pypi_0    pypi
    colorama                  0.4.4                    pypi_0    pypi
    coloredlogs               15.0.1                   pypi_0    pypi
    cryptography              36.0.2                   pypi_0    pypi
    ghostscript               9.54.0               h0e60522_2    conda-forge
    humanfriendly             10.0                     pypi_0    pypi
    img2pdf                   0.4.3                    pypi_0    pypi
    jbig                      2.1               h8d14728_2003    conda-forge
    jpeg                      9e                   h8ffe710_0    conda-forge
    leptonica                 1.78.0               h688788b_4    conda-forge
    lerc                      3.0                  h0e60522_0    conda-forge
    libarchive                3.5.2                habf0b7a_1    conda-forge
    libdeflate                1.10                 h8ffe710_0    conda-forge
    libffi                    3.4.2                h8ffe710_5    conda-forge
    libiconv                  1.16                 he774522_0    conda-forge
    libpng                    1.6.37               h1d00b33_2    conda-forge
    libtiff                   4.3.0                hc4061b1_3    conda-forge
    libwebp                   1.2.2                h57928b3_0    conda-forge
    libwebp-base              1.2.2                h8ffe710_1    conda-forge
    libxml2                   2.9.12               hf5bbc77_2    conda-forge
    libzlib                   1.2.11            h8ffe710_1014    conda-forge
    lxml                      4.8.0                    pypi_0    pypi
    lz4-c                     1.9.3                h8ffe710_1    conda-forge
    lzo                       2.10              he774522_1000    conda-forge
    ocrmypdf                  13.4.1                   pypi_0    pypi
    openjpeg                  2.4.0                hb211442_1    conda-forge
    openssl                   3.0.2                h8ffe710_1    conda-forge
    packaging                 21.3                     pypi_0    pypi
    pdfminer-six              20211012                 pypi_0    pypi
    pikepdf                   5.1.1                    pypi_0    pypi
    pillow                    9.0.1                    pypi_0    pypi
    pip                       22.0.4             pyhd8ed1ab_0    conda-forge
    pluggy                    1.0.0                    pypi_0    pypi
    pngquant                  1.0.7                    pypi_0    pypi
    pycparser                 2.21                     pypi_0    pypi
    pyparsing                 3.0.7                    pypi_0    pypi
    pyreadline3               3.4.1                    pypi_0    pypi
    python                    3.10.4          hcf16a7b_0_cpython    conda-forge
    python_abi                3.10                    2_cp310    conda-forge
    reportlab                 3.6.9                    pypi_0    pypi
    setuptools                61.3.0          py310h5588dad_0    conda-forge
    sqlite                    3.37.1               h8ffe710_0    conda-forge
    tesseract                 5.0.1                h17c68af_0    conda-forge
    tk                        8.6.12               h8ffe710_0    conda-forge
    tqdm                      4.63.1                   pypi_0    pypi
    tzdata                    2022a                h191b570_0    conda-forge
    ucrt                      10.0.20348.0         h57928b3_0    conda-forge
    vc                        14.2                 hb210afc_6    conda-forge
    vs2015_runtime            14.29.30037          h902a5da_6    conda-forge
    wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
    xz                        5.2.5                h62dcd97_1    conda-forge
    zlib                      1.2.11            h8ffe710_1014    conda-forge
    zstd                      1.5.2                h6255e5f_0    conda-forge

    System (please complete the following information):

    • OS: Windows 10
    • Python version: 3.10
    • OCRmyPDF version: 13.4.1

    Installation Installed via Pip

    Additional context Add any other context about the problem here. I believe this Issue is a similar problem. But the fix was done in Linux OS. I don't know how to fix it under conda

    Here are the before & after files NeedOCR2.pdf output.pdf


    third party issue 
    opened by denzchoe 21
  • Segmentation fault when using pipes

    Segmentation fault when using pipes

    Describe the bug When running ocrmypdf through podman/docker I sometimes (#864) experience segmentation faults and the container hangs indefinitely. The output file is empty.

    To Reproduce The following command is executed to reproduce the failure, due to the non-deterministic behavior of ocrmypdf, it might take a while or even multiple loops to reproduce.

    for i in $(seq 0 100); do
        podman run --rm -i ocrmypdf --verbose -rcd  --jbig2-lossy -l deu - - <tmp.pdf >out.pdf; done

    All of the options can be omitted and the issue is reproducible. The resulting log is:

    ocrmypdf 12.6.0.post6+g42713b77.d20211012
    Running: ['tesseract', '--list-langs']
    stdout/stderr = List of available languages (7):
    Running: ['unpaper', '--version']
    Found unpaper 6.1
    Running: ['tesseract', '--version']
    Found tesseract 4.1.1
    Running: ['gs', '--version']
    Found gs 9.53.3
    reading file from standard input
    os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/stdin, /tmp/ocrmypdf.io.yzr1_6f6/origin.pdf)
    Using Tesseract OpenMP thread limit 3
        1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=jpeggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.yzr1_6f6/origin.pdf']
        1 Rotating output by 0
        1 Running: ['tesseract', '-l', 'osd', '--psm', '0', '/tmp/ocrmypdf.io.yzr1_6f6/000001_rasterize_preview.jpg', 'stdout']
        1 page is facing ⇧, confidence 7.23 - no change
        1 Rasterize with pnggray, rotation 0
        1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.yzr1_6f6/origin.pdf']
        1 Rotating output by 0
        1 Running: ['unpaper', '-v', '--dpi', '150.0', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpmqv67lqw/input.pnm', '/tmp/tmpmqv67lqw/output.pgm']
        1 stdout/stderr = [image2 @ 0x55a80053afc0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
    [image2 @ 0x55a80053afc0] Encoder did not produce proper pts, making some up.
    unpaper 6.1
    License GPLv2: GNU GPL version 2.
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    Processing sheet #1: /tmp/tmpmqv67lqw/input.pnm -> /tmp/tmpmqv67lqw/output.pgm
    input-file for sheet 1: /tmp/tmpmqv67lqw/input.pnm
    output-file for sheet 1: /tmp/tmpmqv67lqw/output.pgm
    sheet size: 1232x1718
    noise-filter ... deleted 47 clusters.
    blur-filter... deleted 0 pixels.
    writing output.
        1 resolution (150.01239999999999, 150.01239999999999)
        1 convert
        1 PIL format = PNG
        1 imgformat = PNG
        1 input dpi = 150 x 150
        1 rotation = 0°
        1 input colorspace = L
        1 width x height = 1232px x 1718px
        1 read_images() embeds a PNG
        1 convert done
        1 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.yzr1_6f6/000001_ocr.png', '/tmp/ocrmypdf.io.yzr1_6f6/000001_ocr_tess', 'pdf', 'txt']
        1 Emplacement update
        1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
        1 Grafting
        1 Page rotation: (content, auto) -> page = (0, 0) -> 0
    os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/graft_layers.pdf, /tmp/ocrmypdf.io.yzr1_6f6/fix_docinfo.pdf)
    Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.yzr1_6f6/fix_docinfo.pdf', '/tmp/ocrmypdf.io.yzr1_6f6/pdfa.ps']
    GPL Ghostscript 9.53.3 (2020-10-01)
    Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
    This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
    see the file COPYING for details.
    Processing pages 1 through 1.
    Page 1
    Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
    The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
    Treating 18 as an optimization candidate
    XrefExt(xref=18, ext='.png')
    Optimizable images: JPEGs: 0 PNGs: 1
    Treating 18 as an optimization candidate
    Optimizable images: JBIG2 groups: (0,)
    Optimize ratio: 1.00 savings: 0.0%
    os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/optimize.opt.pdf, /tmp/ocrmypdf.io.yzr1_6f6/optimize.pdf)
    /tmp/ocrmypdf.io.yzr1_6f6/optimize.pdf -> -
    Output sent to stdout

    dmesg yields:

    [21719.464718] conmon[91767]: segfault at 111d000 ip 00007fcf434cf980 sp 00007ffc7f66d4e8 error 4 in libc.so.6[7fcf43380000+176000]
    [21719.464741] Code: d7 c1 85 c0 75 a4 48 81 ea 80 00 00 00 0f 86 07 01 00 00 48 ff c7 89 f9 48 83 cf 7f 83 e1 7f 48 01 ca 0f 1f 84 00 00 00 00 00 <c5> fd 74 4f 01 c5 fd 74 57 21 c5 fd 74 5f 41 c5 fd 74 67 61 c5 ed

    (Always the same location in libc)

    Exchanging >out.pdf with tee out.pdf I at some point could see strange characters being omited after %%EOF (?), however, most of the time it hangs before that.

    Example file The example file is attached in encrypted form. tmp.pdf.gpg.zip

    Expected behavior The output file should be correct and the tool should not hang.


    • OS: Fedora 35
    • OCRmyPDF Version: 12.6.0.post6+g42713b77.d20211012, but reproducible just as well with jbarlow83/ocrmypdf:v13.2.0, jbarlow83/ocrmypdf:v13.1.1 and jbarlow83/ocrmypdf:v13.1.0
    • How did you install ocrmypdf? podman pull jbarlow83/ocrmypdf
    third party issue 
    opened by Fulguritus 20
  • White glyphs when selecting ocr-text in Evince

    White glyphs when selecting ocr-text in Evince

    Problem in evince pdf reader:


    It only happens when selecting. Is this a display failure? missing fonts? otherwise ocr text is correct. Similar to #178?

    opened by robinrosenstock 20
  • [BUG] crash when trying to process a pdf

    [BUG] crash when trying to process a pdf

    Describe the bug I have a pdf which causes a crash.

    To Reproduce

    Just pulled a fresh docker image and ran:

    docker run --rm -i jbarlow83/ocrmypdf -v1 - - <~/fastrak-2022-12-27.pdf >output.pdf

    Here's the stack trace:

    ocrmypdf 14.0.2.dev8+g14a60936.d20221215
    Running: ['tesseract', '--version']
    Found tesseract 5.2.0-80-g4906
    Running: ['tesseract', '--version']
    Running: ['gs', '--version']
    Found gs 9.55.0
    Running: ['gs', '--version']
    Running: ['tesseract', '--list-langs']
    stdout/stderr = List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (7):
    reading file from standard input
    os.symlink(/tmp/ocrmypdf.io.9dcgi0mu/stdin, /tmp/ocrmypdf.io.9dcgi0mu/origin.pdf)
    An exception occurred while executing the pipeline
    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_sync.py", line 378, in run_pipeline
        pdfinfo = get_pdfinfo(
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_pipeline.py", line 165, in get_pdfinfo
        return PdfInfo(
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 934, in __init__
        self._pages = _pdf_pageinfo_concurrent(
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 711, in _pdf_pageinfo_concurrent
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_concurrent.py", line 87, in __call__
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/builtin_plugins/concurrency.py", line 141, in _execute
        result = future.result()
      File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
        return self.__get_result()
      File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
        raise self._exception
      File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 668, in _pdf_pageinfo_sync
        page = PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 748, in __init__
        self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 794, in _gather_pageinfo
        for info in _process_content_streams(
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 595, in _process_content_streams
        yield from _find_regular_images(container, contentsinfo)
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 501, in _find_regular_images
        for pdfimage, xobj in _image_xobjects(container):
      File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 483, in _image_xobjects
        if '/Subtype' not in candidate:
    TypeError: argument of type 'NoneType' is not iterable

    Example file example.zip


    • OS: Linux
    • OCRmyPDF Version: 14.0.2.dev8+g14a60936.d20221215
    • How did you install ocrmypdf? Docker
    opened by frrad 0
  • Feature request: Ask user what likely-incorrect words are

    Feature request: Ask user what likely-incorrect words are

    OCRmyPDF is great as it is. It is an excellent tool for OCRing PDFs without any human involvement.

    However, if a human is available, their involvement could be put to good use.

    Problem Tesseract+OCRmyPDF doesn't OCR every word correctly when this is desired. When outputting to PDFs, correcting such PDFs is more difficult than correcting outputted text.

    Proposed solution Tesseract generates a low confidence value for words it has difficulty working out the glyphs for. I understand OCRmyPDF checks all words against a dictionary for the selected language as part of it's existing process.

    A word that is likely wrong could have the part of the image containing it's sentence presented to the user (with the word identified with a red box), the user asked what the word is (like a CAPTCHA), and the OCR results amended. If a word the user provides isn't in the dictionary, they should be asked if they want to add it or not.

    This would happen in parallel with the main processing. Sentences containing words identified for checking would cumulatively fill the screen, waiting for human response.

    The proposed functionality would obviously not be default, and should have appropriate user settings for adjustment.

    Describe alternatives you've considered Get OCRmyPDF to output hOCR and PDF files simultaneously, then go through pages manually using gImageReader. It would work. But more slowly than the proposed method would be.

    opened by mattention 0
  • Is it possible to capture Tesseract messages and suggestions either as exceptions or exit codes?

    Is it possible to capture Tesseract messages and suggestions either as exceptions or exit codes?

    Is your feature request related to a problem? Please describe. Sometimes when running OCR jobs with redo_ocr, I can see certain suggestions like rescanning the file with force_ocr from OCRmyPDF and similar observations about the quality of text from Tesseract. Is it possible to somehow capture these messages, so that I can programmatically filter those files out and rerun OCR with the recommended parameters?

    Describe the solution you'd like A status code or custom exception to catch and retry the running job.

    Describe alternatives you've considered Filtering stdout and looking for said keywords.

    Example file N/A

    Additional context N/A

    opened by sergeyyurkov1 0
  • [BUG] `--deskew` not compatible with blank pages or with tesseract_timeout = 0

    [BUG] `--deskew` not compatible with blank pages or with tesseract_timeout = 0

    Describe the bug The --deskew option is not behaving as expected on Ocrmypdf 13.7.0. I am experiencing two issues related to deskew.

    Issue 1: Deskew not working on blank pages

    I'm using the following options --output-type=pdf --tesseract-timeout=30on this blank_image.pdf. When I run the Ocrmypdf command above, I get a SubprocessOutputError. I see that issue is referenced here: https://github.com/ocrmypdf/OCRmyPDF/issues/868, but I don't think the bug fix covered all scenarios.

    Issue 2: Deskew not working with tesseract_timeout=0

    I want to deskew PDFs without running OCR on them, as mentioned in the docs here. However, when --tesseract-timeout=0, the document is not being deskewed because OCR is not being run. If I change --tesseract-timeout to a different integer, it successfully deskews. Here is a skewed PDF that can be used to reproduce the issue: skewed_text.pdf

    To Reproduce Issue 1: Use blank_image.pdf and run ocrmypdf --deskew --output-type=pdf --tesseract-timeout=30 blank_image.pdf result.pdf . Issue2: Use skewed_text.pdf and run ocrmypdf --deskew --output-type=pdf --tesseract-timeout=0 skewed_text.pdf result_pdf.

    Expected behavior I expect that blank pages do not completely block the ocrmypdf command from running. It should be able to gracefully handle the error and skip deskewing that specific page. I expect that with --tesseract_timeout=0 the page can be deskewed without having OCR applied.

    Screenshots If applicable, add screenshots to help explain your problem. Deskew with 0 second timeout: skewed_with_0_second_timeout Deskew with 30 second timeout: skewed_with_30_second_timeout

    System (please complete the following information):

    • OS: MacOS Ventura 13.0.1
    • OCRmyPDF version: 13.7.0

    Installation brew install ocrmypdf

    opened by deexpabada 0
  • Spaces in Japanese

    Spaces in Japanese

    Hi all! I wonder if it is possible to do OCR having all spaces completely ignored in the outcome? Languages like Japanese do not really use any spaces (even after commas or periods), but currently OCRmyPDF seems to find spaces between almost every character, which is very problematic when you want to search for sentences/words in the document, or google translate parts of it... Thank you in advance!

    opened by KajiyaOokami 3
  • Ignore Digital Signed Documents

    Ignore Digital Signed Documents


    Is there a way to ignore digital signed documents? And was there any changes recently? I would swear a year ago digital signed documents would just thrown an error.


    opened by flaviobrunopereira 0
  • v4.0(Feb 17, 2016)

    • Automatic page rotation (-r) is now available. It uses ignores any prior rotation information on PDFs and sets rotation based on the dominant orientation of detectable text. This feature is fairly reliable but some false positives occur especially if there is not much text to work with. (#4)
    • Deskewing is now performed using Leptonica instead of unpaper. Leptonica is faster and more reliable at image deskewing than unpaper.
    Source code(tar.gz)
    Source code(zip)
  • v3.2(Feb 5, 2016)

  • v3.1.1(Jan 10, 2016)

  • v3.1(Dec 4, 2015)

    • Default output format is now PDF/A-2b instead of PDF/A-1b
    • Python 3.5 and OS X El Capitan are now supported platforms - no changes were needed to implement support
    • Improved some error messages related to missing input files
    • Fixed issue #20 - uppercase .PDF extension not accepted
    • Fixed an issue where OCRmyPDF failed to text that certain pages contained previously OCR'ed text, such as OCR text produced by Tesseract 3.04
    • Inserts /Creator tag into PDFs so that errors can be traced back to this project
    • Added new option --pdf-renderer=auto, to let OCRmyPDF pick the best PDF renderer. Currently it always chooses the 'hocrtransform' renderer but that behavior may change.
    • Set up Travis CI automatic integration testing
    Source code(tar.gz)
    Source code(zip)
  • v3.0(Sep 14, 2015)

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

doc2text doc2text extracts higher quality text by fixing common scan errors Developing text corpora can be a massive pain in the butt. Much of the tex

Joe Sutherland 1.3k Jan 4, 2023
Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Deskew by Marek Mauder https://galfar.vevb.net/deskew https://github.com/galfar/deskew v1.30 2019-06-07 Overview Deskew is a command line tool for des

Marek Mauder 127 Dec 3, 2022
This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Script_Convertir_PDF_IMG_TXT Este script de pyhton convierte un pdf en Imagen luego utilizando tesseract como motor OCR convierte la Imagen a Texto. p

alebogado 1 Jan 27, 2022
Recognizing the text contents from a scanned visiting card

Recognizing the text contents from a scanned visiting card. The application which is used to recognize the text from scanned images,printeddocuments,r

Faizan Habib 1 Jan 28, 2022
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

null 27.5k Jan 8, 2023
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Este programa tem o intuito de ser um modificador de arquivos PDF. Os arquivos PDFs podem ser 3: PDFs verdadeiros - em que podem ser selecionados o ti

Daniel Soares Saldanha 2 Oct 11, 2021
Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

PDFImage2TXT - DOWNLOAD INSTALLER HERE What can you do with it? Convert scanned PDFs to TXT. Convert scanned Documents to TXT. No coding required!! In

Hans Alemão 2 Feb 22, 2022
Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Microsoft 235 Dec 22, 2022
A post-processing tool for scanned sheets of paper.

unpaper Originally written by Jens Gulden — see AUTHORS for more information. Licensed under GNU GPL v2 — see COPYING for more information. Overview u

null 27 Dec 7, 2022
Library used to deskew a scanned document

Deskew //Note: Skew is measured in degrees. Deskewing is a process whereby skew is removed by rotating an image by the same amount as its skew but in

Stéphane Brunner 273 Jan 6, 2023
Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

Jainam Shah 243 Dec 30, 2022
Extract tables from scanned image PDFs using Optical Character Recognition.

ocr-table This project aims to extract tables from scanned image PDFs using Optical Character Recognition. Install Requirements Tesseract OCR sudo apt

Abhijeet Singh 209 Dec 6, 2022
Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

Org. Account 165 Dec 31, 2022
Some bits of javascript to transcribe scanned pages using PageXML

nashi (nasḫī) Some bits of javascript to transcribe scanned pages using PageXML. Both ltr and rtl languages are supported. Try it! But wait, there's m

Andreas Büttner 15 Nov 9, 2022
scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Scan Tailor - scantailor.org This project is no longer maintained, and has not been maintained for a while. About Scan Tailor is an interactive post-p

null 1.5k Dec 28, 2022
Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

OCRopus 285 Dec 8, 2022