pdf_sprinkles: sprinkles text in your PDFs

Will Angley

Last update: Dec 17, 2021

Related tags

PDF Files Processing pdf_sprinkles

Overview

`pdf_sprinkles`: sprinkles text in your PDFs

pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searchable text.

It runs on the command-line or as a web server. The server version can be deployed to App Engine easily.

pdf_sprinkles has only been tested with English-language text, but it should work for most European languages supported by the Document AI API today. It is known not to work with RTL languages and with CJK scripts currently.

Installation

pdf_sprinkles is experimental, so it's not packaged yet. To install:

Set up Google Cloud Document AI, following the quickstart.
Clone this repository and cd to it.
Create a virtualenv, pdf_sprinkles$ virtualenv env.
Install requirements, pdf_sprinkles$ pip install -r requirements.txt.

Save your location, processor_id and project_id in a flagfile:

pdf_sprinkles$ cat >flagfile
--location='your-location' # 'us' or 'eu'
--processor_id='your-processor-id'
--project_id='your-project-id'
pdf_sprinkles$

Quickstart

Activate the virtualenv:

pdf_sprinkles$ . env/bin/activate

and invoke pdf_sprinkles_cli.py with your input and output:

(env) pdf_sprinkles$ ./pdf_sprinkles_cli.py --flagfile=flagfile --input=scan.pdf --output=scan-ocr.pdf

or invoke pdf_sprinkles_web.py and visit it at http://localhost:8888/ :

(env) pdf_sprinkles$ ./pdf_sprinkles_web.py --flagfile=flagfile

Usage

pdf_sprinkles_web.py

USAGE: ./pdf_sprinkles_web.py [flags]

./pdf_sprinkles_web.py:

--address: Address to bind to. (default: '127.0.0.1')
--[no]cloud_logging: Use cloud logging. (default: 'false')
--cookie_secret_id: ID of a cookie secret in Secrets Manager
--[no]debug: Starts Tornado in debugging mode. (default: 'false')
--port: Port to bind to (default: '8888') (an integer)
--self_link: If set, displays a self link in the header.

uimodules:

--faq_link: If set, displays an FAQ link in the footer.
--mailing_list_link: If set, displays a mailing list link in the footer.

pdf_sprinkles_cli.py

USAGE: ./pdf_sprinkles_cli.py [flags]

./pdf_sprinkles_cli.py:

--input: Path to input file
--output: Path to output file

Shared Flags

These flags can be set for both the CLI and Web frontends.

document_ai_ocr:

--location: : Location of document processor (default: 'us')
--processor_id: ID of document processor
--project_id: Google Cloud project ID

third_party.hocr_tools.hocr_pdf:

--min_confidence: Minimum confidence of lines to include in output. (default: '0.9') (a number)

pdf_sprinkles uses Abseil Flags, so you can define rarely changing flags in a file and import it with --flagfile=FILENAME.

Running on App Engine

IMPORTANT: this is only meant to be used in a trusted environment; Document AI requests are much costlier than normal web requests, and this can rapidly turn into a denial-of-wallet attack if running on the public Internet.

pdf_sprinkles ships with configs to run on a Python 3 Standard Environment runtime. It uses supervisord, with listening port and number of workers controlled by environment variables.

Set up config files

copy app.yaml.example to app.yaml.
Adjust instance size / workers / scaling to taste. For instance, if you have a busy environment and don't mind a few hundred dollars a month in costs, set:
```
 env_variables:
     WORKERS: 4
 instance_class: F4_1G

 automatic_scaling:
   min_idle_instances: 1
```
copy supervisord.conf.example to supervisord.conf.
update flags in supervisord.conf to match the flagfile.

Cookie Secret

The app can uses a cookie secret for XSRF protection. Since checking secrets in to Git is a bad idea, we use Secret Manager instead.

You'll need to set this up on first use.

Generate a 32-byte symmetric key:

$ head -c 32 /dev/urandom | base64
BNUV6qSX0YOjatf4kfYBHUKVlD3kw+89hLia5M1Pduw=
$

and store it in Secret Manager.

Grant the app service account access to the secret and its versions (see IAM Roles, below.)
Set --cookie_secret_id in supervisord.conf to match.

IAM Roles

The service account for the app needs project-level IAM roles:

roles/documentai.apiUser, Document AI > Cloud DocumentAI API User
roles/logging.logWriter, Logging > Logs Writer

and needs access to its cookie secret, granted with:

roles/secretmanager.secretAccessor, Secret Manager Secret Accessor
roles/secretmanager.viewer, Secret Manager Viewer

Deploy

Run pdf_sprinkles$ gcloud app deploy.

License

pdf_sprinkles is licensed under the Apache License, Version 2.0.

Comments

pdf_info crashes when reading a PDF with non-integer mediaboxes
Steps to reproduce:

Visit home page.

Choose a PDF with non-integer mediaboxes, like: Valedictory Speech 2004-06-17.pdf

Click Submit

What should happen:

The PDF is recognized

What actually happens:

The app displays:

An error occurred

Stream is closed
opened by willangley 2
Add support for Cloud IAP.

Add support for Cloud Identity-Aware Proxy. Turning it on will make it safer to run PDF Sprinkles for yourself.

This isn't ready to merge quite yet; I need to document the --expected_audience flag used to configure this.

opened by willangley 1
Make PDF Sprinkles portable to Google's monorepo
Support hermetic Python interpreters

Relax sandboxing slightly to account for interpreter differences

Work with dependencies at monorepo's LTS versions
opened by willangley 0
Wrap code in a module

This is needed for Copybara, and also a good idea in general. Note that this module is not yet ready for direct use, you should continue to use requirements.txt instead.

opened by willangley 0
Check in a version of libseccomp for aarch64.

This is needed when working in a Multipass VM on a macOS Apple Silicon machine. It's not a complete solution – pikepdf doesn't build binary wheels for linux_aarch64 either – but it's progress.

I've filed pikepdf/pikepdf#272 to request support upstream; will see how that goes before deciding if I want to check my own wheel in here.

opened by willangley 0
$Handle fractional-sized pages$

Handle fractional-sized pages "correctly" as floats.

Both ReportLab and img2pdf expect floats, rather than decimal.Decimal, so there's no value to passing Decimal out from pdf_info in practice. Fixes #1.

opened by willangley 0