x-ray is a Python library for finding bad redactions in PDF documents.

Overview

Image of REDACTED STAMP

x-ray is a Python library for finding bad redactions in PDF documents.

Why?

At Free Law Project, we collect millions of PDFs. An ongoing problem is that people fail to properly redact things. Instead of doing it the right way, they just draw a black rectangle or a black highlight on top of black text and call it a day. Well, when that happens you just select the text under the rectangle, and you can read it again. Not great.

After witnessing this problem for years, we decided it would be good to figure out how common it is, so, with some help, we built this simple tool. You give the tool the path to a PDF. It tells you if it has worthless redactions in it.

What next?

Right now, x-ray works pretty well and we are using it to analyze documents in our collections. It could be better though. Bad redactions take many forms. See the issues tab for other examples we don't yet support. We'd love your help solving some of tougher cases.

Installation

With poetry, do:

poetry add x-ray

With pip, that'd be:

pip install x-ray

Usage

You can easily use this on the command line. Once installed, just:

% python -m xray path/to/your/file.pdf
{
  "1": [
    {
      "bbox": [
        58.550079345703125,
        72.19873046875,
        75.65007781982422,
        739.3987426757812
      ],
      "text": "The Ring travels by way of Cirith Ungol"
    }
  ]
}

That'll give you json, so you can use it with tools like jq. The format is as follows:

  • It's a dict.
  • The keys are page numbers.
  • Each page number maps to a list of dicts.
  • Each of those dicts maps to two keys.
  • The first key is bbox. This is a four-tuple that indicates the x,y positions of the upper left corner and then lower right corners of the bad redaction.
  • The second key is text. This is the text under the bad rectangle.

Simple enough.

If you want a bit more, you can use x-ray in Python:

from pprint import pprint
import xray
bad_redactions = xray.inspect("some/path/to/your/file.pdf")
pprint(bad_redactions)
{1: [{'bbox': (58.550079345703125,
               72.19873046875,
               75.65007781982422,
               739.3987426757812),
      'text': 'Aragorn is the one true king.'}]}

The output is the same as above, except it's a Python object, not a JSON object.

If you already have the file contents as a bytes object, that'll work too:

some_bytes = requests.get("https://lotr-secrets.com/some-doc.pdf").content
bad_redactions = xray.inspect(some_bytes)

Note that because the inspect method uses the same signature no matter what, the type of the object you give it is essential. So if you do this, xray will assume your file name (provided as bytes) is file contents and it won't work:

xray.inspect(b"some-file-path.pdf")

That's pretty much it. There are no configuration files or other variables to learn. You give it a file name. If there is a bad redaction in it, you'll soon find out.

How it works

Under the covers, x-ray uses the high-performant PyMuPDF project to parse PDFs.

You can read the source to see how it works, but the general idea is to:

  1. Find rectangles in the PDF.

  2. Find letters that are under those rectangles.

Things get tricky in a couple places:

  • letters without ascenders are taller than they seem and might not be entirely under the rectangle
  • drawings in PDFs can contain multiple rectangles
  • text under redactions can be on purpose (like if it says "XXX" or "privileged", etc)

And so forth. We do our best.

Contributions

Please see the issues list for thinsg we need, or start a conversation if you have questions. Before you do your first contribution, we'll need a signed contributor license agreement. See the template in the repo.

Deployment

Releases happen automatically via Github Actions on any commit that is tagged with something like "v0.0.0".

If you wish to create a new version manually, the process is:

  1. Update version info in pyproject.toml

  2. Configure your Pypi credentials with Poetry

  3. Build and publish the version:

poetry publish --build

License

This repository is available under the permissive BSD license, making it easy and safe to incorporate in your own libraries.

Pull and feature requests welcome. Online editing in GitHub is possible (and easy!).

Comments
  • Consider handling bad rectangles

    Consider handling bad rectangles

    This is easily the most common error, and probably an easy one to find. Just look for black rectangle objects on top of text. As a first pass, you could probably just look for black rectangles, but I've heard rumors that's how underlines are stored in PDFs. Here's a good example:

    https://www.courtlistener.com/recap/gov.uscourts.dcd.190597/gov.uscourts.dcd.190597.471.0_6.pdf

    opened by mlissner 5
  • build(deps-dev): Bump pre-commit from 2.19.0 to 2.20.0

    build(deps-dev): Bump pre-commit from 2.19.0 to 2.20.0

    Bumps pre-commit from 2.19.0 to 2.20.0.

    Release notes

    Sourced from pre-commit's releases.

    pre-commit v2.20.0

    Features

    • Expose source and object-name (positional args) of prepare-commit-msg hook as PRE_COMMIT_COMIT_MSG_SOURCE and PRE_COMMIT_COMMIT_OBJECT_NAME.

    Fixes

    Changelog

    Sourced from pre-commit's changelog.

    2.20.0 - 2022-07-10

    Features

    • Expose source and object-name (positional args) of prepare-commit-msg hook as PRE_COMMIT_COMIT_MSG_SOURCE and PRE_COMMIT_COMMIT_OBJECT_NAME.

    Fixes

    Commits
    • 78a2d86 v2.20.0
    • e3dc5b7 Merge pull request #2454 from pre-commit/asottile-patch-1
    • ebce88c remove warnings checks
    • d6cc8a1 Merge pull request #2453 from hroncok/python3.11
    • 901e831 Tests: Adjust traceback regexes to allow Python 3.11+ ^^^^^^^
    • 98bb7e6 Merge pull request #2440 from pre-commit/pre-commit-ci-update-config
    • 706d1e9 Merge pull request #2439 from pre-commit/all-repos_autofix_type-checking
    • 3ebd101 [pre-commit.ci] pre-commit autoupdate
    • d8b5930 remove imports from TYPE_CHECKING (py37+)
    • 170335c Merge pull request #2429 from pre-commit/remove-config-option-when-unused
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

    Dependabot will merge this PR once CI passes on it, as requested by @mlissner.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 3
  • build(deps): Bump types-requests from 2.27.5 to 2.28.0

    build(deps): Bump types-requests from 2.27.5 to 2.28.0

    ⚠️ Dependabot is rebasing this PR ⚠️

    Rebasing might not happen immediately, so don't worry if this takes some time.

    Note: if you make any changes to this PR yourself, they will take precedence over the rebase.


    Bumps types-requests from 2.27.5 to 2.28.0.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

    Dependabot will merge this PR once it's up-to-date and CI passes on it, as requested by @mlissner.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 3
  • build(deps-dev): Bump ipython from 8.1.1 to 8.4.0

    build(deps-dev): Bump ipython from 8.1.1 to 8.4.0

    Bumps ipython from 8.1.1 to 8.4.0.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

    Dependabot will merge this PR once CI passes on it, as requested by @mlissner.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 3
  • build(deps-dev): Bump black from 22.1.0 to 22.3.0

    build(deps-dev): Bump black from 22.1.0 to 22.3.0

    Bumps black from 22.1.0 to 22.3.0.

    Release notes

    Sourced from black's releases.

    22.3.0

    Preview style

    • Code cell separators #%% are now standardised to # %% (#2919)
    • Remove unnecessary parentheses from except statements (#2939)
    • Remove unnecessary parentheses from tuple unpacking in for loops (#2945)
    • Avoid magic-trailing-comma in single-element subscripts (#2942)

    Configuration

    • Do not format __pypackages__ directories by default (#2836)
    • Add support for specifying stable version with --required-version (#2832).
    • Avoid crashing when the user has no homedir (#2814)
    • Avoid crashing when md5 is not available (#2905)
    • Fix handling of directory junctions on Windows (#2904)

    Documentation

    • Update pylint config documentation (#2931)

    Integrations

    • Move test to disable plugin in Vim/Neovim, which speeds up loading (#2896)

    Output

    • In verbose, mode, log when Black is using user-level config (#2861)

    Packaging

    • Fix Black to work with Click 8.1.0 (#2966)
    • On Python 3.11 and newer, use the standard library's tomllib instead of tomli (#2903)
    • black-primer, the deprecated internal devtool, has been removed and copied to a separate repository (#2924)

    Parser

    • Black can now parse starred expressions in the target of for and async for statements, e.g for item in *items_1, *items_2: pass (#2879).
    Changelog

    Sourced from black's changelog.

    22.3.0

    Preview style

    • Code cell separators #%% are now standardised to # %% (#2919)
    • Remove unnecessary parentheses from except statements (#2939)
    • Remove unnecessary parentheses from tuple unpacking in for loops (#2945)
    • Avoid magic-trailing-comma in single-element subscripts (#2942)

    Configuration

    • Do not format __pypackages__ directories by default (#2836)
    • Add support for specifying stable version with --required-version (#2832).
    • Avoid crashing when the user has no homedir (#2814)
    • Avoid crashing when md5 is not available (#2905)
    • Fix handling of directory junctions on Windows (#2904)

    Documentation

    • Update pylint config documentation (#2931)

    Integrations

    • Move test to disable plugin in Vim/Neovim, which speeds up loading (#2896)

    Output

    • In verbose, mode, log when Black is using user-level config (#2861)

    Packaging

    • Fix Black to work with Click 8.1.0 (#2966)
    • On Python 3.11 and newer, use the standard library's tomllib instead of tomli (#2903)
    • black-primer, the deprecated internal devtool, has been removed and copied to a separate repository (#2924)

    Parser

    • Black can now parse starred expressions in the target of for and async for statements, e.g for item in *items_1, *items_2: pass (#2879).
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

    Dependabot will merge this PR once CI passes on it, as requested by @mlissner.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 3
  • build(deps): Bump requests from 2.26.0 to 2.27.1

    build(deps): Bump requests from 2.26.0 to 2.27.1

    Bumps requests from 2.26.0 to 2.27.1.

    Changelog

    Sourced from requests's changelog.

    2.27.1 (2022-01-05)

    Bugfixes

    • Fixed parsing issue that resulted in the auth component being dropped from proxy URLs. (#6028)

    2.27.0 (2022-01-03)

    Improvements

    • Officially added support for Python 3.10. (#5928)

    • Added a requests.exceptions.JSONDecodeError to unify JSON exceptions between Python 2 and 3. This gets raised in the response.json() method, and is backwards compatible as it inherits from previously thrown exceptions. Can be caught from requests.exceptions.RequestException as well. (#5856)

    • Improved error text for misnamed InvalidSchema and MissingSchema exceptions. This is a temporary fix until exceptions can be renamed (Schema->Scheme). (#6017)

    • Improved proxy parsing for proxy URLs missing a scheme. This will address recent changes to urlparse in Python 3.9+. (#5917)

    Bugfixes

    • Fixed defect in extract_zipped_paths which could result in an infinite loop for some paths. (#5851)

    • Fixed handling for AttributeError when calculating length of files obtained by Tarfile.extractfile(). (#5239)

    • Fixed urllib3 exception leak, wrapping urllib3.exceptions.InvalidHeader with requests.exceptions.InvalidHeader. (#5914)

    • Fixed bug where two Host headers were sent for chunked requests. (#5391)

    • Fixed regression in Requests 2.26.0 where Proxy-Authorization was incorrectly stripped from all requests sent with Session.send. (#5924)

    • Fixed performance regression in 2.26.0 for hosts with a large number of proxies available in the environment. (#5924)

    • Fixed idna exception leak, wrapping UnicodeError with requests.exceptions.InvalidURL for URLs with a leading dot (.) in the domain. (#5414)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

    Dependabot will merge this PR once CI passes on it, as requested by @mlissner.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 3
  • build(deps-dev): Bump ipython from 7.27.0 to 7.28.0

    build(deps-dev): Bump ipython from 7.27.0 to 7.28.0

    ⚠️ Dependabot is rebasing this PR ⚠️

    Rebasing might not happen immediately, so don't worry if this takes some time.

    Note: if you make any changes to this PR yourself, they will take precedence over the rebase.


    Bumps ipython from 7.27.0 to 7.28.0.

    Commits
    • e76fa00 release 7.28.0
    • bdf3df8 Merge pull request #13159 from meeseeksmachine/auto-backport-of-pr-13158-on-7.x
    • a486c4b Backport PR #13158: What's new 7.28
    • 6c9f7cd Merge pull request #13121 from meeseeksmachine/auto-backport-of-pr-13091-on-7.x
    • 6b007c1 Merge pull request #13154 from meeseeksmachine/auto-backport-of-pr-13153-on-7.x
    • b8c3dd9 Backport PR #13153: Adapt to all sorts of drive names for cygwin
    • e60034b Merge pull request #13150 from meeseeksmachine/auto-backport-of-pr-13140-on-7.x
    • b9ca351 Backport PR #13140: Use pathlib parent relationships to compare virtualenv di...
    • fca5d2c Merge pull request #13149 from meeseeksmachine/auto-backport-of-pr-13094-on-7.x
    • 4ef07a8 Backport PR #13094: Fix virtual environment user warning for lower case pathes
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

    Dependabot will merge this PR once CI passes on it, as requested by @mlissner.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 3
  • build(deps-dev): Bump wheel from 0.38.2 to 0.38.3

    build(deps-dev): Bump wheel from 0.38.2 to 0.38.3

    Bumps wheel from 0.38.2 to 0.38.3.

    Changelog

    Sourced from wheel's changelog.

    Release Notes

    0.38.3 (2022-11-08)

    • Fixed install failure when used with --no-binary, reported on Ubuntu 20.04, by removing setup_requires from setup.cfg

    0.38.2 (2022-11-05)

    • Fixed regression introduced in v0.38.1 which broke parsing of wheel file names with multiple platform tags

    0.38.1 (2022-11-04)

    • Removed install dependency on setuptools
    • The future-proof fix in 0.36.0 for converting PyPy's SOABI into a abi tag was faulty. Fixed so that future changes in the SOABI will not change the tag.

    0.38.0 (2022-10-21)

    • Dropped support for Python < 3.7
    • Updated vendored packaging to 21.3
    • Replaced all uses of distutils with setuptools
    • The handling of license_files (including glob patterns and default values) is now delegated to setuptools>=57.0.0 (#466). The package dependencies were updated to reflect this change.
    • Fixed potential DoS attack via the WHEEL_INFO_RE regular expression
    • Fixed ValueError: ZIP does not support timestamps before 1980 when using SOURCE_DATE_EPOCH=0 or when on-disk timestamps are earlier than 1980-01-01. Such timestamps are now changed to the minimum value before packaging.

    0.37.1 (2021-12-22)

    • Fixed wheel pack duplicating the WHEEL contents when the build number has changed (#415)
    • Fixed parsing of file names containing commas in RECORD (PR by Hood Chatham)

    0.37.0 (2021-08-09)

    • Added official Python 3.10 support
    • Updated vendored packaging library to v20.9

    0.36.2 (2020-12-13)

    • Updated vendored packaging library to v20.8
    • Fixed wheel sdist missing LICENSE.txt
    • Don't use default macos/arm64 deployment target in calculating the platform tag for fat binaries (PR by Ronald Oussoren)

    0.36.1 (2020-12-04)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 2
  • build(deps-dev): Bump ipython from 8.5.0 to 8.6.0

    build(deps-dev): Bump ipython from 8.5.0 to 8.6.0

    Bumps ipython from 8.5.0 to 8.6.0.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

    Dependabot will merge this PR once CI passes on it, as requested by @mlissner.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 2
  • build(deps): Bump pymupdf from 1.20.1 to 1.20.2

    build(deps): Bump pymupdf from 1.20.1 to 1.20.2

    Bumps pymupdf from 1.20.1 to 1.20.2.

    Release notes

    Sourced from pymupdf's releases.

    PyMuPDF-1.20.2

    • Built with MuPDF-1.20.3.
    • Fix #1787.
    • Fix #1824.
    • Improvements to documentation:
      • Moved old docs/faq.rst into separate docs/recipes-* files.
      • Improved information about building from source in docs/installation.rst.
      • Clarified memory allocation setting JM_MEMORY in docs/tools.rst.
      • Fixed link to PDF Reference manual in docs/app3.rst.
      • Fixed building of html documentation on OpenBSD.

    Wheels for Windows, Linux and MacOS, and the sdist, are available on pypi.org and can be installed in the usual way, for example:

    pip install --upgrade pymupdf
    
    Changelog

    Sourced from pymupdf's changelog.

    Change Log

    Changes in Version 1.20.2

    • This release uses MuPDF-1.20.3.

    • Fixed [#1787](https://github.com/pymupdf/pymupdf/issues/1787) <https://github.com/pymupdf/PyMuPDF/issues/1787>_. Fix linking issues on Unix systems.

    • Fixed [#1824](https://github.com/pymupdf/pymupdf/issues/1824) <https://github.com/pymupdf/PyMuPDF/issues/1824>_. SegFault when applying redactions overlapping a transparent image. (Fixed in MuPDF-1.20.3.)

    • Improvements to documentation:

      • Improved information about building from source in docs/installation.rst.
      • Clarified memory allocation setting JM_MEMORY` in docs/tools.rst``.
      • Fixed link to PDF Reference manual in docs/app3.rst.
      • Fixed building of html documentation on OpenBSD.
      • Moved old docs/faq.rst into separate docs/recipes-* files.
    • Removed some unused files and directories:

      • installation/
      • docs/wheelnames.txt

    Changes in Version 1.20.1

    • Fixed [#1724](https://github.com/pymupdf/pymupdf/issues/1724) <https://github.com/pymupdf/PyMuPDF/issues/1724>_. Fix for building on FreeBSD.

    • Fixed [#1771](https://github.com/pymupdf/pymupdf/issues/1771) <https://github.com/pymupdf/PyMuPDF/issues/1771>_. linkDest() had a broken call to re.match(), introduced in 1.20.0.

    • Fixed [#1751](https://github.com/pymupdf/pymupdf/issues/1751) <https://github.com/pymupdf/PyMuPDF/issues/1751>_. get_drawings() and get_cdrawings() previously always returned with closePath=False.

    • Fixed [#1645](https://github.com/pymupdf/pymupdf/issues/1645) <https://github.com/pymupdf/PyMuPDF/issues/1645>_. Default FreeText annotation text color is now black.

    • Improvements to sphinx-generated documentation:

      • Use readthedocs theme with enhancements.
      • Renamed the .txt files to have .rst suffixes.

    ... (truncated)

    Commits
    • 309223a Update changelogs, version strings and release date for release of 1.20.2.
    • 6efca61 setup.py: build with mupdf-1.20.3.
    • 8db9e71 tests/: added test_1824() for #1824.
    • 721e944 docs/recipes-drawing-and-graphics.rst: fixed external link to shapes_and_symb...
    • 28d0b26 docs/app3.rst: fixed url for PDF Reference manual PDF32000_2008.pdf
    • 4a2d774 .github/workflows/build_wheels.yml: don't ignore pytest failures.
    • 6f324f4 .github/workflows/build_wheels.yml: allow control of what to build.
    • 494c43f docs/pixmap.rst: fixed a typo.
    • 706ef94 Clarify memory allocation setting JM_MEMORY.
    • e0028a9 docs/installation.rst: improved info about building from source.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

    Dependabot will merge this PR once CI passes on it, as requested by @mlissner.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 2
  • build(deps): Bump types-requests from 2.28.3 to 2.28.7

    build(deps): Bump types-requests from 2.28.3 to 2.28.7

    Bumps types-requests from 2.28.3 to 2.28.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 2
  • Add more examples

    Add more examples

    I'm not working on x-ray at the moment, but examples keep coming my way. Next time I work on this, I'll want to consider the attached items to see if they reveal anything useful.

    gov.uscourts.innd.96620.141.0_3.pdf

    opened by mlissner 0
  • Absent a rectangle, white text on the default background is not caught as a bad redaction

    Absent a rectangle, white text on the default background is not caught as a bad redaction

    If you have white text on a white rectangle, we'll catch that because of the rectangle.

    If you just have white text, it'll be invisible to the human eye, but we won't catch that because it'll be on the white background of the standard page.

    Interesting corner case.

    enhancement 
    opened by mlissner 0
  • Use X-Ray to analyze complete RECAP corpus

    Use X-Ray to analyze complete RECAP corpus

    I ran an AWS batch job on this so far. It ran overnight and produced about a gig of results for the entire RECAP corpus. Some of the results are timeouts that we could do a better job catching and retrying, many are false positives, but there are true positives too.

    I'll use this issue as a meta-issue to group together things that need fixing.

    opened by mlissner 1
  • Ignore uniform dates under redactions

    Ignore uniform dates under redactions

    It seems to be common to put dates under the redaction boxes, as you can see in the highlighted screenshot below:

    Screenshot from 2021-09-11 11-21-13

    Note that the date isn't actually relevant semantically to the sentence. Looking throughout the redactions of this document:

    {2: [{'bbox': (390.3498229980469,
                   536.0278930664062,
                   415.180419921875,
                   552.8250122070312),
          'text': '03/23/2019'}],
     20: [{'bbox': (434.0060119628906,
                    293.506103515625,
                    446.1649169921875,
                    307.0159912109375),
           'text': '03/23/2'}],
     29: [{'bbox': (197.58200073242188,
                    75.3205795288086,
                    224.60189819335938,
                    89.5059814453125),
           'text': '03/23/2019'},
          {'bbox': (232.70700073242188,
                    75.31907653808594,
                    269.1838073730469,
                    88.8289794921875),
           'text': '03/23/2019'},
          {'bbox': (278.6400146484375,
                    75.99359130859375,
                    319.1697998046875,
                    87.47698974609375),
           'text': '03/23/2019'},
          {'bbox': (348.2170104980469,
                    75.3205795288086,
                    421.17059326171875,
                    89.5059814453125),
           'text': '03/23/2019'},
    

    You see a pattern that the text is always the same date. When this is the case, we should nuke all such redactions from our list as false positives.

    gov.uscourts.cacd.45170.569.9_2.pdf

    examples-needed 
    opened by mlissner 1
  • crappy Sharpie redactions

    crappy Sharpie redactions

    another category of crappy redactions (that I didn't see noted elsewhere) are crappy see-thru uses of Sharpie, e.g. https://storage.courtlistener.com/recap/gov.uscourts.txnd.338502/gov.uscourts.txnd.338502.242.1_1.pdf where maybe 33% of the redactions are see-thru.

    opened by jeremybmerrill 1
  • Do something if the index might give away the game

    Do something if the index might give away the game

    Slate has a great article about how they used the index of a document to figure out a bunch of redactions. Basically, a word would be redacted in some places, but not others, so you could look up the word in the index and figure it out:

    https://slate.com/news-and-politics/2020/10/ghislaine-maxwell-deposition-redactions-epstein-how-to-crack.html

    I suppose this is beyond what computers can do, BUT it'd be nice if we could highlight if there's an index that could be used for this purpose?

    opened by mlissner 3
Owner
Free Law Project
We provide free access to primary legal materials, develop legal research tools, and support academic research on legal corpora.
Free Law Project
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

null 1.8k Dec 29, 2022
WeasyPrint is a smart solution helping web developers to create PDF documents.

WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…

Kozea 5.4k Jan 8, 2023
minipdf is a package for creating simple, single-page PDF documents.

minipdf minipdf is a package for creating simple, single-page PDF documents. Installation You can install the development version from GitHub with: #

mikefc 41 Dec 19, 2022
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

null 1 Nov 30, 2021
PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

null 9 Jan 30, 2022
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox module of the Tkinter library.

RISHABH MISHRA 1 Feb 13, 2022
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 7, 2022
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 1, 2023
This book will take you on an exploratory journey through the PDF format, and the borb Python library.

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

Joris Schellekens 281 Jan 1, 2023
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

null 1.6k Jan 3, 2023
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 4, 2023
A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

Martin D. 394 Dec 30, 2022
Performing the following operations using python on PDF.

Python PDF Handling Tutorial Python is a highly versatile language with a huge set of libraries. It is a high level language with simple syntax. Pytho

Prajwol Lamichhane 131 Dec 16, 2022
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 2, 2022
Python lib for Simple PDF text extraction

Python lib for Simple PDF text extraction

Jason Alan Palmer 651 Jan 1, 2023
Simple python tool created for downloading PDF.

PDFdownloader Usage Open PDF in full-screen mode Run scan.exe Enter how many pages you want to scan Focus PDF After scanning is done, run merge.exe En

null 5 Oct 27, 2021
A simple pdf size compressing telegram robot witten in python.

Pdf Compressor Telegram Bot ##About : A simple pdf size compressing telegram robot witten in python. Mostly useful for digital documentation. Deploy t

Renjith Mangal 22 Oct 28, 2022