OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Overview

OCRmyPDF

Build Status PyPI version Homebrew version ReadTheDocs Python versions

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

ocrmypdf                      # it's a scriptable command line program
   -l eng+fra                 # it supports multiple languages
   --rotate-pages             # it can fix pages that are misrotated
   --deskew                   # it can deskew crooked PDFs!
   --title "My PDF"           # it can change output metadata
   --jobs 4                   # it uses multiple cores by default
   --output-type pdfa         # it produces PDF/A by default
   input_scanned.pdf          # takes PDF input (or images)
   output_searchable.pdf      # produces validated PDF output

See the release notes for details on the latest changes.

Main features

  • Generates a searchable PDF/A file from a regular PDF
  • Places OCR text accurately below the image to ease copy / paste
  • Keeps the exact resolution of the original embedded images
  • When possible, inserts OCR information as a "lossless" operation without disrupting any other content
  • Optimizes PDF images, often producing files smaller than the input file
  • If requested, deskews and/or cleans the image before performing OCR
  • Validates input and output files
  • Distributes work across all available CPU cores
  • Uses Tesseract OCR engine to recognize more than 100 languages
  • Scales properly to handle files with thousands of pages
  • Battle-tested on millions of PDFs

For details: please consult the documentation.

Motivation

I searched the web for a free command line tool to OCR PDF files: I found many, but none of them were really satisfying:

  • Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
  • Or they did not handle accents and multilingual characters
  • Or they changed the resolution of the embedded images
  • Or they generated ridiculously large PDF files
  • Or they crashed when trying to OCR
  • Or they did not produce valid PDF files
  • On top of that none of them produced PDF/A files (format dedicated for long time storage)

...so I decided to develop my own tool.

Installation

Linux, Windows, macOS and FreeBSD are supported. Docker images are also available, for both x64 and ARM.

Operating system Install command
Debian, Ubuntu apt install ocrmypdf
Windows Subsystem for Linux apt install ocrmypdf
Fedora dnf install ocrmypdf
macOS brew install ocrmypdf
LinuxBrew brew install ocrmypdf
FreeBSD pkg install py37-ocrmypdf
Conda conda install ocrmypdf

For everyone else, see our documentation for installation steps.

Languages

OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users, you can often find packages that provide language packs:

# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr

# Debian/Ubuntu users
apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language pack

# Arch Linux users
pacman -S tesseract-data-eng tesseract-data-deu # Example: Install the English and German language packs

# brew macOS users
brew install tesseract-lang

You can then pass the -l LANG argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple languages can be requested.

OCRmyPDF supports Tesseract 4.0 and the beta versions of Tesseract 5.0. It will automatically use whichever version it finds first on the PATH environment variable. On Windows, if PATH does not provide a Tesseract binary, we use the highest version number that is installed according to the Windows Registry.

Documentation and support

Once OCRmyPDF is installed, the built-in help which explains the command syntax and options can be accessed via:

ocrmypdf --help

Our documentation is served on Read the Docs.

Please report issues on our GitHub issues page, and follow the issue template for quick response.

Requirements

In addition to the required Python version (3.7+), OCRmyPDF requires external program installations of Ghostscript and Tesseract OCR. OCRmyPDF is pure Python, and runs on pretty much everything: Linux, macOS, Windows and FreeBSD.

Press & Media

Business enquiries

OCRmyPDF would not be the software that it is today without companies and users choosing to provide support for feature development and consulting enquiries. We are happy to discuss all enquiries, whether for extending the existing feature set, or integrating OCRmyPDF into a larger system.

License

The OCRmyPDF software is licensed under the Mozilla Public License 2.0 (MPL-2.0). This license permits integration of OCRmyPDF with other code, included commercial and closed source, but asks you to publish source-level modifications you make to OCRmyPDF.

Some components of OCRmyPDF have other licenses, as noted in those files and the debian/copyright file. Most files in misc/ use the MIT license, and the documentation and test files are generally licensed under Creative Commons ShareAlike 4.0 (CC-BY-SA 4.0).

Disclaimer

The software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Comments
  • Improve user experience for Windows 10

    Improve user experience for Windows 10

    Hi

    Describe the issue I've managed to run OCRmyPDF.exe on Windows 10 without wsl.

    To Reproduce I've made fork and added some quick fixes in this commit: https://github.com/dibu28/OCRmyPDF/commit/543088e79e8649e968d02d8fd268123255607dc1

    Fixes are:

    1. in leptonica.py librray name is liblept-5 instead of lept
    2. in ghostscript.py 2.1) executable name is gswin64c.exe instead of gs 2.2) NamedTemporaryFile doesnt work properly and gs could not modify tmp file with access denied error. (so as a temporary workaround I'm adding "_1" to temp file name and then removing file. There could be some better way)
    3. in _pipeline.py and helpers.py files - symlinking to temp folder on windows requires Admin privelegies. So instead of simlinking I'm just copying files.
    4. in _sync.py file - os.path.samefile is returning error: "OSError: [WinError 1] Incorrect function: 'nul'"

    So after those changes and installin dependencies it started to work from command line like this: OCRmyPDF.exe input.pdf output.pdf

    Dependencies and binaries I'm using: https://www.python.org/ftp/python/3.7.5/python-3.7.5-amd64.exe https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs950/gs950w64.exe https://github.com/qpdf/qpdf/releases/download/release-qpdf-9.0.2/qpdf-9.0.2-bin-msvc64.zip

    Add paths to PATH variable: set PATH=%PATH%;C:\Program Files\Tesseract-OCR; set PATH=%PATH%;C:\Program Files\gs\gs9.50\bin; set PATH=%PATH%;C:\qpdf\qpdf-9.0.2-bin-msvc64\qpdf-9.0.2\bin;

    python setup.py build
    OCRmyPDF.exe input.pdf output.pdf
    

    Expected behavior Can we add some workarounds using conditions based on os type?

    System:

    • OS: Windows 10
    • OCRmyPDF Version: v9.0.5

    Additional context

    enhancement 
    opened by dibu28 57
  • OSError: cannot load library 'C:\Program Files\Tesseract-OCR\liblept-5.dll': error 0x7e

    OSError: cannot load library 'C:\Program Files\Tesseract-OCR\liblept-5.dll': error 0x7e

    As in #631 I am getting the same error. Instead of 0x7f I am getting 0x7e
    I am using Python 3.9.2 64 bit, Windows 10 64 bit and OCRMYPDF = 12.5.0 I cant solve the problem as solved #631 by changing leptonica.py, that is by opening zlib.dll before liblept-5.dll.

    When I run the code ocrmypdf --help or ocrmypdf --version it displays same OSerror.

    Does anyone know what to do? @jbarlow83

    opened by meet1919 28
  • Add interword space option to HOCR pdf renderer

    Add interword space option to HOCR pdf renderer

    This pull request adds a new advanced option --interword-spaces to OCRmyPDF to allow the hocr renderer to produce PDF output compatible with PDF.js and potentially other viewers that have difficulty detecting phrases, lines, and paragraphs in separately placed text layers. This new switch is a workaround for limitations of the PDF.js viewer described in https://github.com/jbarlow83/OCRmyPDF/issues/133.

    Background

    OCRmyPDF justifiably prioritizes the accurate placement of words on the text layer as individual glyphs. Most PDF viewers have heuristics that allow them to identify paragraphs, lines, and phrases while searching and to insert the correct inter-word spacing when copying and pasting. PDF.js has over 80 issues flagged with 4-text-selection and there have been a number of pull requests to address the issue that have apparently gotten bogged down with edge cases, performance concerns, and perhaps the inherent challenges of a pure Javascript and HTML approach to PDF rendering.

    Strategy

    The goal of this pull request is to add an unobtrusive option to OCRmyPDF to allow it to produce PDF.js compatible output for those that must support PDF.js as a business requirement. Specifically, this PR follows the code conventions by adding an advanced option --interword-spaces to the options parser and ensures this option is available to the hocrtransform.py renderer. When set to true, the HOCR renderer will add an additional space at the end of each text element before drawing it on the text layer. This option does not apply to other pdf renderers in OCRmyPDF, is turned off by default, and issues a warning if used without the --pdf-renderer hocr option also set.

    Documentation

    This PR added a new section to the advanced documentation for the new option, a note on the 'hocr' renderer description about the option in the same file, and a note that this is available in the introduction where there is a relevant discussion of PDF as a layout format dependent on the viewer to interpret the structure of the document in terms of words, sentences, and paragraphs.

    Testing

    We confirmed that existing tests that exercised this code continue to pass. We encountered some seemingly preexisting failures in other tests. We explored the option of adding additional tests for confirm the warning is provided and the output is as expected, and would welcome guidance as to where that test should be placed and how best to combine it with RENDERERS tests in the test_main.py or the more specific test_hocrtransform.py.

    Sample PDF Output

    The following file was processed with this option set to true. When loaded into the latest PDF.js viewer, multi-word search and copy and paste are improved over the standard HOCR output:

    Input PDF: https://github.com/logikcull/OCRmyPDF/blob/master/tests/resources/linn.pdf

    # original command 
    ocrmypdf --output-type pdf --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.pdf
    

    Output PDF: https://www.dropbox.com/s/2ugzxldqsvy8q6x/output.linn.hocr.pdf

    Behavior when loaded into latest PDF.js viewer -- note that you have to remove spaces to find multiple words. Selecting and pasting the text also has spaces removed:

    screen shot 2018-03-01 at 12 03 10 pm
    # command with new --interword-spaces option
    ocrmypdf --output-type pdf --interword-spaces --pdf-renderer hocr ./tests/resources/linn.pdf output.linn.hocr.interword.pdf
    

    Output PDF: https://www.dropbox.com/s/tukp6ftpjebe1gh/output.linn.hocr.interword.pdf

    Behavior when loaded into latest PDF.js viewer -- note that you can find multiple words separated by spaces. Copy and paste is also improved: screen shot 2018-03-01 at 12 03 42 pm

    Testing in Adobe Reader and Chrome's native PDF viewer showed that files rendered with the new option continued to perform as well or better when searching and copying and pasting. Apple Preview handled neither output file particularly well so we think there has at least been no harm done.

    Alternative Approaches

    If this approach of adding a new option with a warning if used without hocr is too disruptive, we could also consider contributing a new pipeline task for a fifth renderer titled 'hocr-sloppy-text' or something similar that runs a nearly identical version of hocrtransform.py with the space suffix turned on by default. This approach has the serious downside of repeating complex code, but the upside of leaving the existing hocr rendered behavior 100% unchanged and opening the way in the future for other "sloppy-text" fixes required to produce PDFs for simpler viewers like PDF.js.

    Related Issues

    OCRMyPDF:

    • 133: Some hints that Tesseract upgrades might provide some relief, but underlying conclusion was that PDF.js has a naive implementation of text selection and word boundaries (https://github.com/jbarlow83/OCRmyPDF/issues/133).

    Tesseract:

    • 1235 December 2017: https://github.com/tesseract-ocr/tesseract/issues/1235 includes good explanation of reason for space detection issues: "Known problem. Root cause is PDF spec which forces heuristics into text extraction, and Preview is well known to have some of the wonkiest heuristics."
    • 699 https://github.com/tesseract-ocr/tesseract/issues/699#issuecomment-277486345
    • 382 https://github.com/tesseract-ocr/tesseract/issues/382
    • 337 https://github.com/tesseract-ocr/tesseract/issues/337

    PDF.js:

    • 7310: Super helpful discussion of HTML divs: https://github.com/mozilla/pdf.js/issues/7310
    • 6657: https://github.com/mozilla/pdf.js/issues/6657
    • Related PR not merged: https://github.com/mozilla/pdf.js/pull/5783
    • Dozens of text selection issues: https://github.com/mozilla/pdf.js/issues?q=is%3Aopen+is%3Aissue+label%3A4-text-selection
    opened by cforcey 28
  • NixOS packaging issues

    NixOS packaging issues

    Hi there

    I'm currently trying to write a package file for ORCmyPDF for NixOS. I think I'm already pretty far but now I'm stuck on an error that I have no idea how to fix, as it doesn't seem to give any indication, where the problem actually occurs.

    Anyway, I do get this error when it's trying to build OCRmyPDF:

    building path(s) ‘/nix/store/kdpr7qaz85lrls5mwqyvgrfi5v811i5q-ORCmyPDF-5.4.3’
    unpacking sources
    unpacking source archive /nix/store/ajl9ibrhpbbrrccnyb7s7rl4ix8w7k48-source
    source root is source
    setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/tests/test_userunit.py
    patching sources
    configuring
    building
    Skipping external program tests because of --force
    Traceback (most recent call last):
      File "nix_run_setup.py", line 8, in <module>
        exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
      File "setup.py", line 245, in <module>
        zip_safe=False)
      File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/core.py", line 108, in setup
        _setup_distribution = dist = klass(attrs)
      File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 338, in __init__
        _Distribution.__init__(self, attrs)
      File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/dist.py", line 281, in __init__
        self.finalize_options()
      File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 471, in finalize_options
        ep.load()(self, ep.name, value)
      File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 188, in cffi_modules
        add_cffi_module(dist, cffi_module)
      File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 49, in add_cffi_module
        execfile(build_file_name, mod_vars)
      File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 22, in execfile
        src = f.read()
      File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
    builder for ‘/nix/store/jsnfzz199dy49viv14l1is2i1d2r3lq9-ORCmyPDF-5.4.3.drv’ failed with exit code 1
    cannot build derivation ‘/nix/store/niq3y1rw30sqx5gp5jwrd273hlv6xhb2-system-path.drv’: 1 dependencies couldn't be built
    cannot build derivation ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’: 1 dependencies couldn't be built
    error: build of ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’ failed
    

    The current nix expression that I use to try to build it looks like:

    { lib, fetchFromGitHub, python3, callPackage, pytest, unpaper, ghostscript, tesseract, qpdf }:
    
    with python3.pkgs;
    
    let
    
      ruffus = callPackage ("/tankJL/opt/ruffus.nix") {};
      img2pdf = callPackage ("/tankJL/opt/img2pdf.nix") {};
    
    in
    
    buildPythonApplication rec {
      version = "5.4.3";
      name = "ORCmyPDF-${version}";
    
      src = fetchFromGitHub {
        owner = "jbarlow83";
        repo = "OCRmyPDF";
        rev = version;
        sha256 = "0vnn6g69vkqldbx76llmyz8h9ia7mkxcp290mxdsydy4bjjik6zf";
      };
    
      postPatch = ''
        substituteInPlace requirements.txt \
          --replace "ruffus == 2.6.3" "ruffus" \
          --replace "Pillow == 4.3.0" "Pillow" \
          --replace "reportlab == 3.4.0" "reportlab" \
          --replace "PyPDF2 == 1.26.0" "PyPDF2" \
          --replace "img2pdf == 0.2.4" "img2pdf" \
          --replace "cffi == 1.11.2" "cffi"
        substituteInPlace test_requirements.txt \
          --replace "pytest >= 3.0" "pytest"
        export SETUPTOOLS_SCM_PRETEND_VERSION="${version}"
      '';
    
      buildInputs = [ pytest pytest_xdist pytestcov setuptools_scm ];
    
      propagatedBuildInputs = [
        ruffus
        pillow
        reportlab
        pypdf2
        img2pdf
        cffi
        unpaper
        ghostscript
        tesseract
        qpdf
      ];
    
      meta = {
        homepage = https://github.com/jbarlow83/OCRmyPDF;
        description = "Adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.";
        license = lib.licenses.mit;
        maintainers = with lib.maintainers; [ hyper_ch ];
      };
    }
    

    I understand that there seems to be a problem with one of the files but I can't figure out where the problem actually occurs.

    opened by sjau 26
  • ocrmypdf 11.4.4 failed to build on apple silicon

    ocrmypdf 11.4.4 failed to build on apple silicon

    Describe the bug ocrmypdf 11.4.4 failed to build on apple silicon

    build error message (run log url):

    ==> /opt/homebrew/Cellar/ocrmypdf/11.4.4/bin/ocrmypdf -f -q --deskew /opt/homebrew/Library/Homebrew/test/support/fixtures/test.pdf ocr.pdf
    Traceback (most recent call last):
      File "/opt/homebrew/Cellar/ocrmypdf/11.4.4/bin/ocrmypdf", line 5, in <module>
        from ocrmypdf.__main__ import run
      File "/opt/homebrew/Cellar/ocrmypdf/11.4.4/libexec/lib/python3.9/site-packages/ocrmypdf/__init__.py", line 10, in <module>
        from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo
      File "/opt/homebrew/Cellar/ocrmypdf/11.4.4/libexec/lib/python3.9/site-packages/ocrmypdf/leptonica.py", line 174, in <module>
        def _stderr_handler(cstr):
    MemoryError: Cannot allocate write+execute memory for ffi.callback(). You might be running on a system that prevents this. For more information, see https://cffi.readthedocs.io/en/latest/using.html#callbacks
    

    To Reproduce pip installation and run on darwin arm64 system?

    Expected behavior build successfuly

    System (please complete the following information):

    • OS: OSX darwin arm64
    • Python version: python 3.9
    • OCRmyPDF version: Ocrmypdf 11.4.4

    Additional context relates to https://github.com/Homebrew/homebrew-core/pull/68159

    bug 
    opened by chenrui333 25
  • dependecy problem reportlab - allthough installed...

    dependecy problem reportlab - allthough installed...

    Issue by andreasotto Tue Nov 4 10:44:25 2014 Originally opened as https://github.com/fritz-hh/OCRmyPDF/issues/99


    # ./OCRmyPDF.sh /home/ao/Leerungstermine189973.PDF /home/ao/test.pdf
    Please install the python library reportlab. Exiting...
    
    # apt-get install python-reportlab
    python-reportlab ist schon die neueste Version.
    

    .. already installed.

    Debian 6 squeeze

    opened by OCRmyPDF-issuebot 25
  • Using Ubuntu Snap as packaging format

    Using Ubuntu Snap as packaging format

    I took the liberty of creating a snap application recipe "snapcraft.yaml" which enables snapcraft's build plattform to build a working snap application for ocrmypdf.

    Take a look here: https://github.com/alexanderlanganke/ocrmypdf-snap

    While building it pulls in the application using PIP so that it always uses the most recent version. This may make it easier for users to access ocrmypdf.

    So far I am getting the application to build and run but am running into a missing dependancy during runtime. I believe I need to adjust the path for one or two libraries.

    I have also registered this snap (private for now) on snapcraft.

    If you are interested, and I get it working, I would offer to maintain this snap for you or pass it on to you if you wish to do it yourself. Credit for the application will of course go to you! Snapcraft pulls from github so you basically need to get it working once and never touch it again. It will rebuild whenever you push to the linked repository (version bump for example).

    opened by alexanderlanganke 23
  • [13.4.2] lossy compression of pngs into jpegs when it shouldn't

    [13.4.2] lossy compression of pngs into jpegs when it shouldn't

    1. It might be just the older version, but ocrmypdf 12.7.2 seems to compress uncompressed pngs into (lossy) jpegs:
    $ ocrmypdf --version
    12.7.2
    $ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
    $ convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
    $ img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
    $ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
    $ pdfimages -list ./Example-uncompress-compress.pdf
    page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
    --------------------------------------------------------------------------------------------
       1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 4157B 4.5%
    

    I believe it should be running the image through pngquant instead at optimize level 1.

    1. Btw, it's probably not even worth mentioning since, looking at the changelog, I'm fairly certain you've already sorted it out in recent ocrmypdf versions, but small pdfs with small pngs grow instead of shrinking / remaining the same:
    $ ocrmypdf --version
    12.7.2
    $ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
    $ img2pdf ./Example.png -o ./Example.pdf
    $ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
    $ stat -c "%n,%s" Example*.* | column -t -s,
    Example-compress.pdf  7799
    Example.pdf           3906
    Example.png           2335
    

    Though this might also be the pdf format changing to the archival specs...

    1. As a side note, if compute time isn't a factor, I personally found 'optipng -o7' to produce smaller pngs than pngquant and 'jpegrescan -i -t -v' to produce the smallest jpeg, even compared to MozJPEG despite the author saying otherwise oddly enough.

    p.s. forgot to mention the png-to-jpeg bug also happens with some compressed pngs but I haven't bothered trying to replicate this since I believe it should never try to convert bitmap images to jpegs to begin with.

    opened by RamKromberg 21
  • Anaconda - Successful Install but not working

    Anaconda - Successful Install but not working

    Describe the bug (*update: 2022-04-22): Reorder sentences

    What's the problem? I tried installing ocrmypdf using Conda on Windows; it looks successful. I tried to run tesseract tests.jpg, and it works fine. (ocrmypdf) C:\Users\Denz\Downloads>tesseract test.jpg test

    But whenever I run a test pdf, it doesn't output the OCR text. Here is the error log:

    (ocrmypdf) C:\Users\Denz\Downloads>ocrmypdf --force-ocr NeedOCR2.pdf output.pdf
    Scanning contents: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.11page/s]
        1 page already has text! - rasterizing text and running OCR anyway
        1 [tesseract] read_params_file: Can't open pdf
        1 [tesseract] read_params_file: Can't open txt
    OCR: 100%|█████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:07<00:00,  7.41s/page]
    Postprocessing...
    PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.20s/page]
    Recompressing JPEGs: 0image [00:00, ?image/s]
    Deflating JPEGs: 0image [00:00, ?image/s]
    JBIG2: 0item [00:00, ?item/s]
    Optimize ratio: 1.00 savings: -0.0%
    Image optimization did not improve the file - optimizations will not be used
    Output file is a PDF/A-2B (as expected)
    

    my Environment Packages inside Conda

    (ocrmypdf) C:\Users\Denz\Downloads>conda list
    # packages in environment at C:\Users\Denz\anaconda3\envs\ocrmypdf:
    #
    # Name                    Version                   Build  Channel
    bzip2                     1.0.8                h8ffe710_4    conda-forge
    ca-certificates           2021.10.8            h5b45459_0    conda-forge
    cffi                      1.15.0                   pypi_0    pypi
    chardet                   4.0.0                    pypi_0    pypi
    colorama                  0.4.4                    pypi_0    pypi
    coloredlogs               15.0.1                   pypi_0    pypi
    cryptography              36.0.2                   pypi_0    pypi
    ghostscript               9.54.0               h0e60522_2    conda-forge
    humanfriendly             10.0                     pypi_0    pypi
    img2pdf                   0.4.3                    pypi_0    pypi
    jbig                      2.1               h8d14728_2003    conda-forge
    jpeg                      9e                   h8ffe710_0    conda-forge
    leptonica                 1.78.0               h688788b_4    conda-forge
    lerc                      3.0                  h0e60522_0    conda-forge
    libarchive                3.5.2                habf0b7a_1    conda-forge
    libdeflate                1.10                 h8ffe710_0    conda-forge
    libffi                    3.4.2                h8ffe710_5    conda-forge
    libiconv                  1.16                 he774522_0    conda-forge
    libpng                    1.6.37               h1d00b33_2    conda-forge
    libtiff                   4.3.0                hc4061b1_3    conda-forge
    libwebp                   1.2.2                h57928b3_0    conda-forge
    libwebp-base              1.2.2                h8ffe710_1    conda-forge
    libxml2                   2.9.12               hf5bbc77_2    conda-forge
    libzlib                   1.2.11            h8ffe710_1014    conda-forge
    lxml                      4.8.0                    pypi_0    pypi
    lz4-c                     1.9.3                h8ffe710_1    conda-forge
    lzo                       2.10              he774522_1000    conda-forge
    ocrmypdf                  13.4.1                   pypi_0    pypi
    openjpeg                  2.4.0                hb211442_1    conda-forge
    openssl                   3.0.2                h8ffe710_1    conda-forge
    packaging                 21.3                     pypi_0    pypi
    pdfminer-six              20211012                 pypi_0    pypi
    pikepdf                   5.1.1                    pypi_0    pypi
    pillow                    9.0.1                    pypi_0    pypi
    pip                       22.0.4             pyhd8ed1ab_0    conda-forge
    pluggy                    1.0.0                    pypi_0    pypi
    pngquant                  1.0.7                    pypi_0    pypi
    pycparser                 2.21                     pypi_0    pypi
    pyparsing                 3.0.7                    pypi_0    pypi
    pyreadline3               3.4.1                    pypi_0    pypi
    python                    3.10.4          hcf16a7b_0_cpython    conda-forge
    python_abi                3.10                    2_cp310    conda-forge
    reportlab                 3.6.9                    pypi_0    pypi
    setuptools                61.3.0          py310h5588dad_0    conda-forge
    sqlite                    3.37.1               h8ffe710_0    conda-forge
    tesseract                 5.0.1                h17c68af_0    conda-forge
    tk                        8.6.12               h8ffe710_0    conda-forge
    tqdm                      4.63.1                   pypi_0    pypi
    tzdata                    2022a                h191b570_0    conda-forge
    ucrt                      10.0.20348.0         h57928b3_0    conda-forge
    vc                        14.2                 hb210afc_6    conda-forge
    vs2015_runtime            14.29.30037          h902a5da_6    conda-forge
    wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
    xz                        5.2.5                h62dcd97_1    conda-forge
    zlib                      1.2.11            h8ffe710_1014    conda-forge
    zstd                      1.5.2                h6255e5f_0    conda-forge
    

    System (please complete the following information):

    • OS: Windows 10
    • Python version: 3.10
    • OCRmyPDF version: 13.4.1

    Installation Installed via Pip

    Additional context Add any other context about the problem here. I believe this Issue is a similar problem. But the fix was done in Linux OS. I don't know how to fix it under conda


    Here are the before & after files NeedOCR2.pdf output.pdf

    https://github.com/ocrmypdf/OCRmyPDF/issues/773

    third party issue 
    opened by denzchoe 21
  • Segmentation fault when using pipes

    Segmentation fault when using pipes

    Describe the bug When running ocrmypdf through podman/docker I sometimes (#864) experience segmentation faults and the container hangs indefinitely. The output file is empty.

    To Reproduce The following command is executed to reproduce the failure, due to the non-deterministic behavior of ocrmypdf, it might take a while or even multiple loops to reproduce.

    for i in $(seq 0 100); do
        podman run --rm -i ocrmypdf --verbose -rcd  --jbig2-lossy -l deu - - <tmp.pdf >out.pdf; done
    done
    

    All of the options can be omitted and the issue is reproducible. The resulting log is:

    ocrmypdf 12.6.0.post6+g42713b77.d20211012
    Running: ['tesseract', '--list-langs']
    stdout/stderr = List of available languages (7):
    chi_sim
    deu
    eng
    fra
    osd
    por
    spa
    
    Running: ['unpaper', '--version']
    Found unpaper 6.1
    Running: ['tesseract', '--version']
    Found tesseract 4.1.1
    Running: ['gs', '--version']
    Found gs 9.53.3
    reading file from standard input
    os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/stdin, /tmp/ocrmypdf.io.yzr1_6f6/origin.pdf)
    Using Tesseract OpenMP thread limit 3
        1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=jpeggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.yzr1_6f6/origin.pdf']
        1 Rotating output by 0
        1 Running: ['tesseract', '-l', 'osd', '--psm', '0', '/tmp/ocrmypdf.io.yzr1_6f6/000001_rasterize_preview.jpg', 'stdout']
        1 page is facing ⇧, confidence 7.23 - no change
        1 Rasterize with pnggray, rotation 0
        1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.yzr1_6f6/origin.pdf']
        1 Rotating output by 0
        1 Running: ['unpaper', '-v', '--dpi', '150.0', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpmqv67lqw/input.pnm', '/tmp/tmpmqv67lqw/output.pgm']
        1 stdout/stderr = [image2 @ 0x55a80053afc0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
    [image2 @ 0x55a80053afc0] Encoder did not produce proper pts, making some up.
    unpaper 6.1
    License GPLv2: GNU GPL version 2.
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    
    -------------------------------------------------------------------------------
    Processing sheet #1: /tmp/tmpmqv67lqw/input.pnm -> /tmp/tmpmqv67lqw/output.pgm
    input-file for sheet 1: /tmp/tmpmqv67lqw/input.pnm
    output-file for sheet 1: /tmp/tmpmqv67lqw/output.pgm
    sheet size: 1232x1718
    ...
    noise-filter ... deleted 47 clusters.
    blur-filter... deleted 0 pixels.
    writing output.
    
        1 resolution (150.01239999999999, 150.01239999999999)
        1 convert
        1 PIL format = PNG
        1 imgformat = PNG
        1 input dpi = 150 x 150
        1 rotation = 0°
        1 input colorspace = L
        1 width x height = 1232px x 1718px
        1 read_images() embeds a PNG
        1 convert done
        1 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.yzr1_6f6/000001_ocr.png', '/tmp/ocrmypdf.io.yzr1_6f6/000001_ocr_tess', 'pdf', 'txt']
        1 Emplacement update
        1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
        1 Grafting
        1 Page rotation: (content, auto) -> page = (0, 0) -> 0
    Postprocessing...
    os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/graft_layers.pdf, /tmp/ocrmypdf.io.yzr1_6f6/fix_docinfo.pdf)
    Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.yzr1_6f6/fix_docinfo.pdf', '/tmp/ocrmypdf.io.yzr1_6f6/pdfa.ps']
    GPL Ghostscript 9.53.3 (2020-10-01)
    Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
    This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
    see the file COPYING for details.
    Processing pages 1 through 1.
    Page 1
    Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
    The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
    Treating 18 as an optimization candidate
    XrefExt(xref=18, ext='.png')
    Optimizable images: JPEGs: 0 PNGs: 1
    Treating 18 as an optimization candidate
    Optimizable images: JBIG2 groups: (0,)
    Optimize ratio: 1.00 savings: 0.0%
    os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/optimize.opt.pdf, /tmp/ocrmypdf.io.yzr1_6f6/optimize.pdf)
    /tmp/ocrmypdf.io.yzr1_6f6/optimize.pdf -> -
    Output sent to stdout
    

    dmesg yields:

    [21719.464718] conmon[91767]: segfault at 111d000 ip 00007fcf434cf980 sp 00007ffc7f66d4e8 error 4 in libc.so.6[7fcf43380000+176000]
    [21719.464741] Code: d7 c1 85 c0 75 a4 48 81 ea 80 00 00 00 0f 86 07 01 00 00 48 ff c7 89 f9 48 83 cf 7f 83 e1 7f 48 01 ca 0f 1f 84 00 00 00 00 00 <c5> fd 74 4f 01 c5 fd 74 57 21 c5 fd 74 5f 41 c5 fd 74 67 61 c5 ed
    

    (Always the same location in libc)

    Exchanging >out.pdf with tee out.pdf I at some point could see strange characters being omited after %%EOF (?), however, most of the time it hangs before that.

    Example file The example file is attached in encrypted form. tmp.pdf.gpg.zip

    Expected behavior The output file should be correct and the tool should not hang.

    System

    • OS: Fedora 35
    • OCRmyPDF Version: 12.6.0.post6+g42713b77.d20211012, but reproducible just as well with jbarlow83/ocrmypdf:v13.2.0, jbarlow83/ocrmypdf:v13.1.1 and jbarlow83/ocrmypdf:v13.1.0
    • How did you install ocrmypdf? podman pull jbarlow83/ocrmypdf
    third party issue 
    opened by Fulguritus 20
  • White glyphs when selecting ocr-text in Evince

    White glyphs when selecting ocr-text in Evince

    Problem in evince pdf reader:

    screenshot_20180405_101240

    It only happens when selecting. Is this a display failure? missing fonts? otherwise ocr text is correct. Similar to #178?

    opened by robinrosenstock 20
  • Feature request: Ask user what likely-incorrect words are

    Feature request: Ask user what likely-incorrect words are

    OCRmyPDF is great as it is. It is an excellent tool for OCRing PDFs without any human involvement.

    However, if a human is available, their involvement could be put to good use.

    Problem Tesseract+OCRmyPDF doesn't OCR every word correctly when this is desired. When outputting to PDFs, correcting such PDFs is more difficult than correcting outputted text.

    Proposed solution Tesseract generates a low confidence value for words it has difficulty working out the glyphs for. I understand OCRmyPDF checks all words against a dictionary for the selected language as part of it's existing process.

    A word that is likely wrong could have the part of the image containing it's sentence presented to the user (with the word identified with a red box), the user asked what the word is (like a CAPTCHA), and the OCR results amended. If a word the user provides isn't in the dictionary, they should be asked if they want to add it or not.

    This would happen in parallel with the main processing. Sentences containing words identified for checking would cumulatively fill the screen, waiting for human response.

    The proposed functionality would obviously not be default, and should have appropriate user settings for adjustment.

    Describe alternatives you've considered Get OCRmyPDF to output hOCR and PDF files simultaneously, then go through pages manually using gImageReader. It would work. But more slowly than the proposed method would be.

    opened by mattention 0
  • Is it possible to capture Tesseract messages and suggestions either as exceptions or exit codes?

    Is it possible to capture Tesseract messages and suggestions either as exceptions or exit codes?

    Is your feature request related to a problem? Please describe. Sometimes when running OCR jobs with redo_ocr, I can see certain suggestions like rescanning the file with force_ocr from OCRmyPDF and similar observations about the quality of text from Tesseract. Is it possible to somehow capture these messages, so that I can programmatically filter those files out and rerun OCR with the recommended parameters?

    Describe the solution you'd like A status code or custom exception to catch and retry the running job.

    Describe alternatives you've considered Filtering stdout and looking for said keywords.

    Example file N/A

    Additional context N/A

    opened by sergeyyurkov1 0
  • [BUG] `--deskew` not compatible with blank pages or with tesseract_timeout = 0

    [BUG] `--deskew` not compatible with blank pages or with tesseract_timeout = 0

    Describe the bug The --deskew option is not behaving as expected on Ocrmypdf 13.7.0. I am experiencing two issues related to deskew.

    Issue 1: Deskew not working on blank pages

    I'm using the following options --output-type=pdf --tesseract-timeout=30on this blank_image.pdf. When I run the Ocrmypdf command above, I get a SubprocessOutputError. I see that issue is referenced here: https://github.com/ocrmypdf/OCRmyPDF/issues/868, but I don't think the bug fix covered all scenarios.

    Issue 2: Deskew not working with tesseract_timeout=0

    I want to deskew PDFs without running OCR on them, as mentioned in the docs here. However, when --tesseract-timeout=0, the document is not being deskewed because OCR is not being run. If I change --tesseract-timeout to a different integer, it successfully deskews. Here is a skewed PDF that can be used to reproduce the issue: skewed_text.pdf

    To Reproduce Issue 1: Use blank_image.pdf and run ocrmypdf --deskew --output-type=pdf --tesseract-timeout=30 blank_image.pdf result.pdf . Issue2: Use skewed_text.pdf and run ocrmypdf --deskew --output-type=pdf --tesseract-timeout=0 skewed_text.pdf result_pdf.

    Expected behavior I expect that blank pages do not completely block the ocrmypdf command from running. It should be able to gracefully handle the error and skip deskewing that specific page. I expect that with --tesseract_timeout=0 the page can be deskewed without having OCR applied.

    Screenshots If applicable, add screenshots to help explain your problem. Deskew with 0 second timeout: skewed_with_0_second_timeout Deskew with 30 second timeout: skewed_with_30_second_timeout

    System (please complete the following information):

    • OS: MacOS Ventura 13.0.1
    • OCRmyPDF version: 13.7.0

    Installation brew install ocrmypdf

    opened by deexpabada 0
  • Spaces in Japanese

    Spaces in Japanese

    Hi all! I wonder if it is possible to do OCR having all spaces completely ignored in the outcome? Languages like Japanese do not really use any spaces (even after commas or periods), but currently OCRmyPDF seems to find spaces between almost every character, which is very problematic when you want to search for sentences/words in the document, or google translate parts of it... Thank you in advance!

    opened by KajiyaOokami 3
  • Ignore Digital Signed Documents

    Ignore Digital Signed Documents

    Hi,

    Is there a way to ignore digital signed documents? And was there any changes recently? I would swear a year ago digital signed documents would just thrown an error.

    Thanks.

    opened by flaviobrunopereira 0
  • Draw/Blanking on wrong spot

    Draw/Blanking on wrong spot

    tesadasgfdgdf.pdf 000001_ocr

    Settings: {"redo_ocr":true,"language":"deu+eng","clean":true}

    The reactangle is always to low and thats why the ouput get completely wrong.

    Can you please look into it. I tried everything but still same. If i change the Font on something else and switch back, everything is right then.

    opened by emre1e 0
Releases(v4.0)
  • v4.0(Feb 17, 2016)

    • Automatic page rotation (-r) is now available. It uses ignores any prior rotation information on PDFs and sets rotation based on the dominant orientation of detectable text. This feature is fairly reliable but some false positives occur especially if there is not much text to work with. (#4)
    • Deskewing is now performed using Leptonica instead of unpaper. Leptonica is faster and more reliable at image deskewing than unpaper.
    Source code(tar.gz)
    Source code(zip)
  • v3.2(Feb 5, 2016)

  • v3.1.1(Jan 10, 2016)

  • v3.1(Dec 4, 2015)

    • Default output format is now PDF/A-2b instead of PDF/A-1b
    • Python 3.5 and OS X El Capitan are now supported platforms - no changes were needed to implement support
    • Improved some error messages related to missing input files
    • Fixed issue #20 - uppercase .PDF extension not accepted
    • Fixed an issue where OCRmyPDF failed to text that certain pages contained previously OCR'ed text, such as OCR text produced by Tesseract 3.04
    • Inserts /Creator tag into PDFs so that errors can be traced back to this project
    • Added new option --pdf-renderer=auto, to let OCRmyPDF pick the best PDF renderer. Currently it always chooses the 'hocrtransform' renderer but that behavior may change.
    • Set up Travis CI automatic integration testing
    Source code(tar.gz)
    Source code(zip)
  • v3.0(Sep 14, 2015)

Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
JoplinPdf2Images - Converts a PDF to images in Joplin and adds it to the specified note as a printout

joplinPdf2Images Converts a PDF to images in Joplin and adds it to the specified

Morten Haahr Kristensen 2 Apr 20, 2022
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

null 1 Nov 30, 2021
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox module of the Tkinter library.

RISHABH MISHRA 1 Feb 13, 2022
Split given PDF document into 4 page groups and convert them to booklet format

PUTO: PDF to Booklet converter Split given PDF document into 4 page groups and convert them to booklet format. It creates a PDF like shown below: Fir

null 3 Mar 12, 2022
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

Nemo 387 Dec 10, 2022
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 1, 2023
Merge multiple PDF files into one.

PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen

Duo Apps 6 Oct 3, 2022
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 2, 2022
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 1, 2023
Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

Hemachandran P 1 Nov 9, 2021
Program that locks/unlocks pdf files🐍

?? ?? PDFtools ?? ?? Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela ?? Aviso ?? Altere os caminhos referente

João Victor Vilela dos Santos 1 Nov 4, 2021
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

null 1.6k Jan 3, 2023
Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

null 130 Dec 26, 2022
Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

Will Fantom 1 Feb 9, 2022
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 7, 2022
Python lib for Simple PDF text extraction

Python lib for Simple PDF text extraction

Jason Alan Palmer 651 Jan 1, 2023
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 4, 2023