PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

Matthew Stamy

Last update: Jan 4, 2023

Related tags

PDF Files Processing PyPDF2

Overview

PyPDF2

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

Homepage
http://mstamy2.github.io/PyPDF2/

Examples

Please see the Sample_Code folder.

Documentation

Documentation is available at
https://pythonhosted.org/PyPDF2/

FAQ

Please see
http://mstamy2.github.io/PyPDF2/FAQ.html

Tests

PyPDF2 includes a test suite built on the unittest framework. All tests are located in the "Tests" folder. Tests can be run from the command line by:

python -m unittest Tests.tests

Comments

PyCryptodome is required for some PDFs, but is not installed automatically as a dependency

When pycryptodome is not installed, pypdf fails to read some PDFs, and gives this error:

pypdf.errors.DependencyError: PyCryptodome is required for AES algorithm

Because I wasn't familiar with pycryptodome, I wasn't sure what I needed to do to get it working. Eventually I figured out that pycryptodome was a Python library, and all I had to do was run pip3 install pycryptodome to fix the error.

If possible, it would be nice if pypdf could 1) install pycryptodome as a dependency as part of the installation process for pypdf, OR 2) provide more information in the error, letting the user know that pycryptodome is a Python library than can be installed via pip.

Environment

Which environment were you using when you encountered the problem?

$ python3 -m platform
macOS-13.1-arm64-arm-64bit

$ python3 -c "import pypdf;print(pypdf.__version__)"
3.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

Install pypdf (pip3 install pypdf).
Make sure pycryptodome is not installed (pip3 uninstall pycryptodome).
Run the following Python script:

from pypdf import PdfReader
from urllib.request import urlopen
from io import BytesIO

# Get the PDF and convert it into a byte stream
pdf_url = 'https://web.archive.org/web/30000101000000if_/http://www.latterdaytruth.org/pdf/100846.pdf'
pdf_file = urlopen(pdf_url).read()
pdf_bytes_stream = BytesIO(pdf_file)

# Load the file with pypdf
pdf_reader = PdfReader(pdf_bytes_stream)

# Print the number of pages
pages_count = len(pdf_reader.pages)
print('Number of pages: {0}'.format(pages_count))

This is the PDF I'm attempting to read: https://web.archive.org/web/30000101000000if_/http://www.latterdaytruth.org/pdf/100846.pdf

Traceback

Traceback (most recent call last):
  File "/Users/sbradshaw/Desktop/test-pypdf-pages.py", line 14, in <module>
    pages_count = len(pdf_reader.pages)
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_page.py", line 2063, in __len__
    return self.length_function()
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_reader.py", line 445, in _get_num_pages
    return self.trailer[TK.ROOT]["/Pages"]["/Count"]  # type: ignore
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 266, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/generic/_base.py", line 259, in get_object
    obj = self.pdf.get_object(self)
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_reader.py", line 1205, in get_object
    retval = self._get_object_from_stream(indirect_reference)  # type: ignore
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_reader.py", line 1136, in _get_object_from_stream
    obj_stm: EncodedStreamObject = IndirectObject(stmnum, 0, self).get_object()  # type: ignore
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/generic/_base.py", line 259, in get_object
    obj = self.pdf.get_object(self)
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_reader.py", line 1269, in get_object
    retval = self._encryption.decrypt_object(
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_encryption.py", line 761, in decrypt_object
    return cf.decrypt_object(obj)
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_encryption.py", line 185, in decrypt_object
    obj._data = self.stmCrypt.decrypt(obj._data)
  File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_encryption.py", line 147, in decrypt
    raise DependencyError("PyCryptodome is required for AES algorithm")
pypdf.errors.DependencyError: PyCryptodome is required for AES algorithm

opened by samuelbradshaw 3

PERF: Use __slots__

More information

An awesome explanation on StackOverflow

How much is it worth?

It's about a 10x improvement in performance

Before:

After:

Text extraction speed according to the benchmark is not affected at all :smiling_face_with_tear:

opened by MartinThoma 1
ROB: ignore_eof everywhere for read_until_regex
This was initially motivated by NumberObject.read_from_stream, which was calling read_until_regex with the default value of ignore_eof=False and thus raising exceptions like:

PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

https://github.com/py-pdf/PyPDF2/commit/431ba7092037af7d1c296f8f280aca167859ce61 demonstrates a similar fix for NameObject.read_from_stream.

From discussion in https://github.com/py-pdf/pypdf/pull/1505, it was realized that the change to NumberObject.read_from_stream had now made ALL callers of read_until_regex pass ignore_eof=True. It's cleaner to remove the parameter entirely and change the default behaviour.
opened by rraval 1

Extracting text doesn't work for cropped boxes

I am trying to read a few boxes from a nested pdf. My approach is to crop only these three boxes, save them as a temp file and read them. But when I try to read these boxes I don't get nothing or I get the whole pdf (as if I haven't cutted it) I can see from the temp file that the cutting went good

Environment

python (miniconda) 3.9.15 pypdf2 2.11.1

Code + PDF

file_path = "example.pdf"
temp_path = "temp.pdf"

from PyPDF2 import PdfReader, PdfWriter
import os
from copy import copy

def crop(PAGE, LEFT, TOP, RIGHT, BOTTOM):
    # pyPDF2 start from the bottom left
    page_x, page_y = PAGE.cropBox.getUpperLeft()

    # convert pyPDF2.FloatObjects into floats
    upper_left = [page_x.as_numeric(), page_y.as_numeric()]

    # find new margins
    new_upper_left  = (upper_left[0] + LEFT, upper_left[1] - TOP)
    new_lower_right = (upper_left[0] + RIGHT, upper_left[1] - BOTTOM)

    #crop
    PAGE.cropbox.upper_left = new_upper_left
    PAGE.cropbox.lower_right = new_lower_right

def read_pdf(FILE):
    input = PdfReader(FILE)
    output = []

    month_page_num = 0
    employee_page_num = 1
    salary_page_num = 2

    tot_pages = len(input.pages)
    #last_page = int(tot_pages/2) # do not mind about this
    last_page = 1
    print('Pages in PDF: ' + str(tot_pages))

    for page in range(last_page):
        temp = PdfWriter()

        # Remove temp file
        if os.path.exists(temp_path):
            os.remove(temp_path)

        print('Working on page: ' + str(page+1))

        # Month
        month_page = copy(input.pages[page])
        crop(month_page, 360, 60, 450, 75)
        temp.add_page(month_page)

        # Employee
        employee_page = copy(input.pages[page])
        crop(employee_page, 365, 108, 490, 124)
        temp.add_page(employee_page)

        # Salary
        salary_page = copy(input.pages[page])
        crop(salary_page, 378, 698, 448, 707)
        temp.add_page(salary_page)

        with open(temp_path, "wb") as pdf:
            temp.write(pdf)

        # extracting text from pages
        raw = PdfReader(temp_path)
        output.append({
            'month': raw.pages[month_page_num].extractText(),
            'employee': raw.pages[employee_page_num].extractText(),
            'salary': raw.pages[salary_page_num].extractText()
        })

    return output

data = read_pdf(file_path)
print(data)

example.pdf temp.pdf

workflow-text-extraction

opened by aster94 2

Random whitespaces are inserted when using page.extract_text()

I am trying to extract text from various PDF documents to use in an NLP project. While using page.extractText() random whitespace is appearing in the outputted words when there are no spaces in the pdf document.

Environment

Using VS code and running via command prompt.

$ python -m platform
Windows-10-10.0.22621-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.12.1

Code + PDF

This is a minimal, complete example that shows the issue:

test_doc.pdf (PDF was generated using default settings in Microsoft word). It looks like this:

The code is:

import os

from PyPDF2 import PdfReader, __version__

pdf = PdfReader(os.path.join(os.getcwd(), "test_doc.pdf"))

print(f"PyPDF2=={__version__}")

text = ""
for page in pdf.pages:
    page_content = page.extract_text()
    text = text + page_content
print(text)

Output

PyPDF2==2.12.1
This is a test document by Ethan Nelson.  
 
Tuesday was a good time to call ( 000) 000-0000 . This is his ph one mu mber . This is a random address for 
testing purposes : 341 Maple st Paytonville Maine 45681.  
Anyway, there are random whitespaces here .

workflow-text-extraction

opened by einelson 11

ROB: Ignore EOF in NumberObject.read_from_stream
Use ignore_eof=True just like NameObject does, which is the only other caller to read_until_regex in this module.

This helps prevent arcane exceptions when trying to parse a number:

PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

The motivation is essentially identical to the change that introduced ignore_eof=True on NameObjects: https://github.com/py-pdf/PyPDF2/commit/431ba7092037af7d1c296f8f280aca167859ce61
opened by rraval 4

Releases(3.2.0)

3.2.0(Dec 31, 2022)
What's Changed

Performance Improvement (PI)

Help the specializing adpative interpreter (#1522)

New Features (ENH)

Add support for page labels (#1519)

Bug Fixes (BUG)

upgrade clone_document_root (#1520) by @pubpub-zz

Miscellaneous

DOC: Fix migration guide link by @abyesilyurt in https://github.com/py-pdf/pypdf/pull/1516

MAINT: Minor Improvements by @robbiebusinessacc in https://github.com/py-pdf/pypdf/pull/1523

New Contributors

@abyesilyurt made their first contribution in https://github.com/py-pdf/pypdf/pull/1516

@robbiebusinessacc made their first contribution in https://github.com/py-pdf/pypdf/pull/1523

Full Changelog: https://github.com/py-pdf/pypdf/compare/3.1.0...3.2.0
Source code(tar.gz)
Source code(zip)
3.1.0(Dec 23, 2022)

What's Changed

Move PyPDF2 to pypdf (#1513). This now it's all lowercase, no number in the name. For installation and for import. PyPDF2 will no longer receive updates. The community should move back to its roots (pydf).

Full Changelog: https://github.com/py-pdf/pypdf/compare/3.0.0...3.1.0
Source code(tar.gz)
Source code(zip)
3.0.0(Dec 22, 2022)
What's Changed

BREAKING CHANGES

Deprecate features with PyPDF2==3.0.0 (#1489)

Refactor Fit / Zoom parameters (#1437)

New Features (ENH)

Add Cloning (#1371) by @pubpub-zz

Allow int for indirect_reference in PdfWriter.get_object (#1490)

Documentation (DOC)

How to read PDFs from S3 (#1509)

Make MyST parse all links as simple hyperlinks (#1506) by @mbromet

Changed 'latest' for 'stable' generated docs (#1495) by @olsonperrensen

Adjust deprecation procedure (#1487)

Maintenance (MAINT)

Use typing.IO for file streams (#1498) by @thehale

New Contributors

@olsonperrensen made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1495

@thehale made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1498

@mbromet made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1506

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.12.1...3.0.0
Source code(tar.gz)
Source code(zip)
2.12.1(Dec 10, 2022)
What's Changed

Documentation (DOC)

Deduplicate extract_text docstring (#1485)

How to cite PyPDF2 (#1476)

Maintenance (MAINT)

Consistency changes:

indirect_ref/ido ➔ indirect_reference, dest➔ page_destination (#1467) by @kygoben

owner_pwd/user_pwd ➔ owner_password/user_password (#1483)

position ➜ page_number in Merger.merge (#1482) by @Infus3d

indirect_ref ➜ indirect_reference (#1484)

New Contributors

@kygoben made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1467

@Infus3d made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1482

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.12.0...2.12.1
Source code(tar.gz)
Source code(zip)
2.12.0(Dec 10, 2022)
What's Changed

Version 2.12.0, 2022-12-10

New Features (ENH)

Add support to extract gray scale images (#1460) by @joeywang4

Make PdfReader.get_object accept integer arguments (#1459) by @pubpub-zz

Add 'threads' property to PdfWriter (#1458) by @pubpub-zz

Add 'open_destination' property to PdfWriter (#1431) by @pubpub-zz

Bug Fixes (BUG)

Scale PDF annotations (#1479) by @joshhendo

Robustness (ROB)

Padding issue with AES encryption (#1469)

Accept empty object as null objects (#1477) by @pubpub-zz

Documentation (DOC)

Add module documentation the PaperSize class (#1447) by @MagnumBarrage

Maintenance (MAINT)

Use 'page_number' instead of 'pagenum' (#1365)

Add List of pages to PageRangeSpec (#1456) by @pubpub-zz

Testing (TST)

Cleanup temporary files (#1454) by @pubpub-zz

Mark test_tounicode_is_identity as external (#1449) by @heirecka

Use Ubuntu 20.04 for running CI test suite (#1452) by @MasterOdin

Full Changelog

New Contributors

@heirecka made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1449

@MagnumBarrage made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1447

@joeywang4 made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1460

@joshhendo made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1479

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.11.2...2.12.0
Source code(tar.gz)
Source code(zip)
2.11.2(Nov 20, 2022)
What's Changed

New Features (ENH)

Add remove_from_tree (#1432) by @pubpub-zz

Add AnnotationBuilder.rectangle (#1388)

Bug Fixes (BUG)

JavaScript executed twice (#1439) by @pubpub-zz

ToUnicode stores /Identity-H instead of stream (#1433) by @pubpub-zz

Declare Pillow as optional dependency (#1392)

Developer Experience (DEV)

Link 'Full Changelog' automatically

Modify read_string_from_stream to a benchmark (#1415)

Improve error reporting of read_object (#1412) by @pubpub-zz

Test Python 3.11 (#1404)

Extend Flake8 ignore list (#1410)

Use correct pytest markers (#1407)

Move project configuration to pyproject.toml (#1382) by @singingwolfboy

Documentation (DOC)

Fix typos in installation.md by @amyreyespdx in https://github.com/py-pdf/PyPDF2/pull/1419

Typos in PDF format documentation by @pavlidvg in https://github.com/py-pdf/PyPDF2/pull/1438

Full Changelog

New Contributors

@singingwolfboy made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1391

@amyreyespdx made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1419

@pavlidvg made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1438

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.11.1...2.11.2
Source code(tar.gz)
Source code(zip)
2.11.1(Oct 9, 2022)
What's Changed

Bug Fixes (BUG)

td matrix (#1373) by @srogmann

Cope with cmap from #1322 (#1372) by @pubpub-zz

Robustness (ROB)

Cope with str returned from get_data in cmap (#1380) by @pubpub-zz

Documentation (DOC)

Remove watermark PageObject declaration as it is already present inside for-loop (#1384) by @cs2sandeep

New Contributors

@cs2sandeep made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1384

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.11.0...2.11.1
Source code(tar.gz)
Source code(zip)
2.11.0(Sep 25, 2022)
What's Changed

New Features (ENH):

Addition of optional visitor-functions in extract_text() (#1252) by @srogmann

Add PageObject.images attribute (#1330) by @MartinThoma

Add metadata.creation_date and modification_date (#1364) by @MartinThoma

Bug Fixes (BUG):

Lookup index in _xobj_to_image can be ByteStringObject (#1366)

'IndexError: index out of range' when using extract_text (#1361)

Errors in transfer_rotation_to_content() (#1356) by @pubpub-zz

Robustness (ROB):

Ensure update_page_form_field_values does not fail if no fields (#1346) by @pubpub-zz

Testing (TST):

read_string_from_stream performance (#1355) by ### @mergezalot

New Contributors

@srogmann made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1252

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.9...2.11.0
Source code(tar.gz)
Source code(zip)
2.10.9(Sep 18, 2022)
What's Changed

New Features (ENH)

Add rotation property and transfer_rotate_to_content (#1348) by @pubpub-zz

Performance Improvements (PI)

Avoid string concatenation with large embedded base64-encoded images (#1350) by @mergezalot

Bug Fixes (BUG)

Format floats using their intrinsic decimal precision (#1267) by @programmarchy

Robustness (ROB)

Fix merge_page for pages without resources (#1349) by @pubpub-zz

New Contributors

@mergezalot made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1350

@programmarchy made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1267

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.8...2.10.9
Source code(tar.gz)
Source code(zip)
2.10.8(Sep 14, 2022)
What's Changed

ROB: Improve NameObject reading/writing by @pubpub-zz in https://github.com/py-pdf/PyPDF2/pull/1345

ENH: Add PageObject.user_unit property by @MartinThoma in https://github.com/py-pdf/PyPDF2/pull/1336

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.7...2.10.8
Source code(tar.gz)
Source code(zip)
2.10.7(Sep 11, 2022)
What's Changed

Bug Fixes (BUG)

Fix Error in transformations (#1341) by @pubpub-zz

Decode #23 in NameObject (#1342) by @pubpub-zz

Testing (TST)

Use pytest.warns() for warnings, and .raises() for exceptions (#1325) by @mgorny

New Contributors

@mgorny made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1325

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.6...2.10.7
Source code(tar.gz)
Source code(zip)
2.10.6(Sep 9, 2022)
What's Changed

Two robustness issues were fixed by @pubpub-zz - thank you :pray: The infinite loop issue might also be a security concern, depending on how you use PyPDF2.

Robustness (ROB):

Fix infinite loop due to Invalid object (#1331)

Fix image extraction issue with superfluous whitespaces (#1327)

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.5...2.10.6
Source code(tar.gz)
Source code(zip)
1.28.6(Sep 8, 2022)
This is a bugfix for the old 1.x branch of PyPDF2 that still supports Python 2. Please try to update to the latest PyPDF2 > 2.0.0 version to get way better text extraction, support for modern encryption, and much more.

What's Changed

BUG: Adjust 'super' calls for Python 2 by @omit66 in https://github.com/py-pdf/PyPDF2/pull/1335

New Contributors

@omit66 made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1335

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.28.5...1.28.6
Source code(tar.gz)
Source code(zip)
2.10.5(Sep 4, 2022)
What's Changed

New Features (ENH)

Process XRefStm (#1297)

Auto-detect RTL for text extraction (#1309) by @pubpub-zz

Bug Fixes (BUG)

Avoid scaling cropbox twice (#1314) by @yegorLitvinov

Robustness (ROB)

Fix offset correction in revised PDF (#1318) by @pubpub-zz

Crop data of /U and /O in encryption dictionary to 48 bytes (#1317) by @exiledkingcc

MultiLine bfrange in cmap (#1299) by @pubpub-zz

Cope with 2 digit codes in bfchar (#1310) by @pubpub-zz

Accept '/annn' charset as ASCII code (#1316) by @pubpub-zz

Log errors during Float / NumberObject initialization (#1315) by @pubpub-zz

Cope with corrupted entries in xref table (#1300) by @pubpub-zz

Documentation (DOC)

Migration guide (PyPDF2 1.x ➔ 2.x) (#1324)

Creating a coverage report (#1319)

Fix AnnotationBuilder.free_text example (#1311)

Fix usage of page.scale by replacing it with page.scale_by (#1313) by @yegorLitvinov

Developer Experience (DEV)

Only run coverage for PyPDF2

Maintenance (MAINT)

PdfReaderProtocol (#1303)

Throw PdfReadError if Trailer can't be read (#1298) by @ediamondscience

Remove catching OverflowException (#1302)

Testing (TST)

Catch Exception for sample-files repo (#1307)

New Contributors

@ediamondscience made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1298

@yegorLitvinov made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1313

@markdlevy made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1311

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.4...2.10.5
Source code(tar.gz)
Source code(zip)
2.10.4(Aug 28, 2022)
What's Changed

Robustness (ROB)

Fix errors/warnings on no /Resources within extract_text (#1276) by @pubpub-zz

Add required line separators in ContentStream ArrayObjects (#1281) by @pubpub-zz

Maintenance (MAINT)

Use NameObject idempotency (#1290)

Testing (TST)

Rectangle deletion (#1289)

Add workflow tests (#1287)

Remove files after tests ran (#1286)

Packaging (PKG)

Add minimum version for typing_extensions requirement (#1277) by @Shortfinga

New Contributors

@Shortfinga made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1277

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.3...2.10.4
Source code(tar.gz)
Source code(zip)
2.10.3(Aug 21, 2022)
What's Changed

Robustness (ROB)

Decrypt returns empty bytestring (#1258) by @pubpub-zz

Developer Experience (DEV)

Modify CI to better verify built package contents (#1244) by @MasterOdin

Maintenance (MAINT)

Let PdfMerger._create_stream raise NotImplemented (#1251) and remove 'mine' as PdfMerger always creates the stream (#1261)

password param of _security._alg32(...) is only a string, not bytes (#1259)

Remove unreachable code in read_block_backwards (#1250) and _extract_text (#1262)

Testing (TST)

Delete annotations (#1263)

Close PdfMerger in tests (#1260)

PdfReader.xmp_metadata workflow (#1257)

Various PdfWriter (Layout, Bookmark deprecation) (#1249)

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.2...2.10.3
Source code(tar.gz)
Source code(zip)
2.10.2(Aug 15, 2022)

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.1...2.10.2
Source code(tar.gz)
Source code(zip)
2.10.1(Aug 15, 2022)
What's Changed

Bug Fixes (BUG)

TreeObject.remove_child had a non-PdfObject assignment for Count (#1233, #1234)

Fix stream truncated prematurely (#1223) by @pubpub-zz

Documentation (DOC)

Fix docstring formatting (#1228)

Maintenance (MAINT)

Split generic.py (#1229)

Testing (TST)

Decrypt AlgV4 with owner password (#1239)

AlgV5.generate_values (#1238)

TreeObject.remove_child / empty_tree (#1235, #1236)

create_string_object (#1232)

Free-Text annotations (#1231)

generic._base (#1230)

Strict get fonts (#1226)

Increase PdfReader coverage (#1219, #1225)

Increase PdfWriter coverage (#1237)

100% coverage for utils.py (#1217)

PdfWriter exception non-binary stream (#1218)

Don't check coverage for deprecated code (#1216)

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.0...2.10.1
Source code(tar.gz)
Source code(zip)
2.10.0(Aug 7, 2022)
What's Changed

New Features (ENH):

"with" support for PdfMerger and PdfWriter (#1193) by @JianzhengLuo

Add AnnotationBuilder.text(...) to build text annotations (#1202)

Bug Fixes (BUG):

Allow IndirectObjects as stream filters (#1211)

Documentation (DOC):

Font scrambling

Page vs Content scaling (#1208)

Example for orientation parameter of extract_text (#1206) by @pubpub-zz

Fix AnnotationBuilder parameter formatting (#1204)

Developer Experience (DEV):

Add flake8-print (#1203)

Maintenance (MAINT):

Introduce WrongPasswordError / FileNotDecryptedError / EmptyFileError (#1201) by @chilledgeek

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.9.0...2.10.0

New Contributors 🎉

@JianzhengLuo made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1193

@chilledgeek made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1201

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.9.0...2.10.0
Source code(tar.gz)
Source code(zip)
2.9.0(Jul 31, 2022)
What's Changed

New Features (ENH)

Add ability to add hex encoded colors to outline items (#1186) by @mtd91429

Add support for pathlib.Path in PdfMerger.merge (#1190) by @MartinThoma

Add link annotation (#1189) by @MartinThoma

Add capability to filter text extraction by orientation (#1175) by @pubpub-zz

Bug Fixes (BUG)

Named Dest in PDF1.1 (#1174) by @pubpub-zz

Incomplete Graphic State save/restore (#1172) by @pubpub-zz

Documentation (DOC)

Update changelog url in package metadata (#1180) by @mkniewallner

Mention camelot for table extraction (#1179) by @MartinThoma

Mention pyHanko for signing PDF documents (#1178) by @MartinThoma

We have CMAP support since a while (#1177) by @MartinThoma

Maintenance (MAINT)

Consistent usage of warnings / log messages (#1164) by @MartinThoma

Consistent terminology for outline items (#1156) by @mtd91429

New Contributors

@mkniewallner made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1180 :tada:

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.8.1...2.9.0
Source code(tar.gz)
Source code(zip)
2.8.1(Jul 25, 2022)
What's Changed

Bug Fixes (BUG)

u_hash in AlgV4.compute_key (#1170) by @exiledkingcc

Robustness (ROB)

Fix loading of file from #134 (#1167)

Cope with empty DecodeParams (#1165) by @pubpub-zz

Documentation (DOC)

Typo in merger deprecation warning message (#1166) by @pubpub-zz

Maintenance (MAINT)

Package updates; solve mypy strict remarks (#1163)

Testing (TST)

Add test from #325 (#1169) by @pubpub-zz

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.8.0...2.8.1
Source code(tar.gz)
Source code(zip)
2.8.0(Jul 24, 2022)
What's Changed

Thank you @pubpub-zz and @exiledkingcc for your contributions :heart:

New Features (ENH)

Add writer.add_annotation, page.annotations, and generic.AnnotationBuilder (#1120)

Bug Fixes (BUG)

Set /AS for /Btn form fields in writer (#1161)

Ignore if /Perms verify failed (#1157)

Robustness (ROB)

Cope with utf16 character for space calculation (#1155)

Cope with null params for FitH / FitV destination (#1152)

Handle outlines without valid destination (#1076)

Developer Experience (DEV)

Introduce _utils.logger_warning (#1148)

Maintenance (MAINT)

Break up parse_to_unicode (#1162)

Add diagnostic output to exception in read_from_stream (#1159)

Reduce PdfReader.read complexity (#1151)

Testing (TST)

Add workflow tests found by arc testing (#1154)

Decrypt file which is not encrypted (#1149)

Test CryptRC4 encryption class; test image extraction filters (#1147)

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.7.0...2.8.0
Source code(tar.gz)
Source code(zip)
2.7.0(Jul 21, 2022)
What's Changed

New Features (ENH)

Add outline_count property (#1129)

Bug Fixes (BUG)

Make reader.get_fields also return dropdowns with options (#1114)

Add deprecated EncodedStreamObject functions back until PyPDF2==3.0.0 (#1139)

Robustness (ROB)

Cope with missing /W entry (#1136)

Cope with invalid parent xref (#1133)

Documentation (DOC)

Contributors file (#1132)

Fix type in signature of PdfWriter.add_uri (#1131)

Developer Experience (DEV)

Add .git-blame-ignore-revs (#1141)

Code Style (STY)

Fixing typos (#1137)

Re-use code via get_outlines_property in tests (#1130)

New Contributors

@KourFrost made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1114

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.6.0...2.7.0
Source code(tar.gz)
Source code(zip)
1.28.5(Jul 21, 2022)
What's Changed

BUG: Add missing deprecated EncodedStreamObject functions by @MasterOdin in https://github.com/py-pdf/PyPDF2/pull/1140

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.28.4...1.28.5
Source code(tar.gz)
Source code(zip)
2.6.0(Jul 17, 2022)
What's Changed

New Features (ENH)

Add color and font_format to PdfReader.outlines[i] (#1104)

Extract Text Enhancement (whitespaces) (#1084)

Bug Fixes (BUG)

Use build_destination for named destination outlines (#1128)

Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118)

Prevent deduplication of PageObject (#1105)

None-check in DictionaryObject.read_from_stream (#1113)

Avoid IndexError in _cmap.parse_to_unicode (#1110)

Documentation (DOC)

Explanation for git submodule

Watermark and stamp (#1095)

Maintenance (MAINT)

Text extraction improvements (#1126)

Destination.color returns ArrayObject instead of tuple as fallback (#1119)

Use add_bookmark_destination in add_bookmark (#1100)

Use add_bookmark_destination in add_bookmark_dict (#1099)

Testing (TST)

Add test for arab text (#1127)

Add xfail for decryption fail (#1125)

Add xfail test for IndexError when extracting text (#1124)

Add MCVE showing outline title issue (#1123)

Code Style (STY)

Use IntFlag for permissions_flag / update_page_form_field_values (#1094)

Simplify code (#1101)

New Contributors

@mtd91429 made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1104

@dkg made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1110

@jlshin made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1113

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.5.0...2.6.0
Source code(tar.gz)
Source code(zip)
2.5.0(Jul 10, 2022)
What's Changed

New Features (ENH)

Add support for indexed color spaces / BitsPerComponent for decoding PNGs (#1067)

Add PageObject._get_fonts (#1083)

Performance Improvements (PI)

Use iterative DFS in PdfWriter._sweep_indirect_references (#1072)

Bug Fixes (BUG)

Let Page.scale also scale the crop-/trim-/bleed-/artbox (#1066)

Column default for CCITTFaxDecode (#1079)

Robustness (ROB)

Guard against None-value in _get_outlines (#1060)

Documentation (DOC)

Stamps and watermarks (#1082)

OCR vs PDF text extraction (#1081)

Python Version support

Formatting of CHANGELOG

Developer Experience (DEV)

Cache downloaded files (#1070)

Speed-up for CI (#1069)

Maintenance (MAINT)

Set page.rotate(angle: int) (#1092)

Issue #416 was fixed by #1015 (#1078)

Testing (TST)

Image extraction (#1080)

Image extraction (#1077)

Code Style (STY)

Apply black

Typo in Changelog

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.4.2...2.5.0
Source code(tar.gz)
Source code(zip)
2.4.2(Jul 5, 2022)
What's Changed

New Features (ENH)

Add PdfReader.xfa attribute (#1026)

Bug Fixes (BUG)

Wrong page inserted when PdfMerger.merge is done (#1063)

Resolve IndirectObject when it refers to a free entry (#1054)

Developer Experience (DEV)

Added {posargs} to tox.ini (#1055)

Maintenance (MAINT)

Remove PyPDF2._utils.bytes_type (#1053)

Testing (TST)

Scale page (indirect rect object) (#1057)

Simplify pathlib PdfReader test (#1056)

IndexError of VirtualList (#1052)

Invalid XML in xmp information (#1051)

No pycryptodome (#1050)

Increase test coverage (#1045)

Code Style (STY)

DOC of compress_content_streams (#1061)

Minimize diff for #879 (#1049)

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.4.1...2.4.2
Source code(tar.gz)
Source code(zip)
2.4.1(Jun 30, 2022)
What's Changed

New Features (ENH)

Add writer.pdf_header property (getter and setter) (#1038)

Performance Improvements (PI)

Remove b_ call in FloatObject.write_to_stream (#1044)

Check duplicate objects in writer._sweep_indirect_references (#207)

Documentation (DOC)

How to surppress exceptions/warnings/log messages (#1037)

Remove hyphen from lossless (#1041)

Compression of content streams (#1040)

Fix inconsistent variable names in add-watermark.md (#1039)

File size reduction

Add CHANGELOG to the rendered docs (#1023)

Maintenance (MAINT)

Handle XML error when reading XmpInformation (#1030)

Deduplicate Code / add mutmut config (#1022)

Code Style (STY)

Use unnecessary one-line function / class attribute (#1043)

Docstring formatting (#1033)

New Contributors

@Hatell made their first contribution in https://github.com/py-pdf/PyPDF2/pull/207

@behzadfhm made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1039

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.4.0...2.4.1
Source code(tar.gz)
Source code(zip)
2.4.0(Jun 26, 2022)
What's Changed

Thanks to @exiledkingcc PyPDF2 now also supports R6 decryption 🎉 Thank you 🤗

New Features (ENH)

Support R6 decrypting (#1015)

Add PdfReader.pdf_header (#1013)

Performance Improvements (PI)

Remove ord_ calls (#1014)

Bug Fixes (BUG)

Fix missing page for bookmark (#1016)

Robustness (ROB)

Deal with invalid Destinations (#1028)

Documentation (DOC)

get_form_text_fields does not extract dropdown data (#1029)

Adjust PdfWriter.add_uri docstring

Mention crypto extra_requires for installation (#1017)

Developer Experience (DEV)

Use /n line endings everywhere (#1027)

Adjust string formatting to be able to use mutmut (#1020)

Update Bug report template

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.3.1...2.4.0
Source code(tar.gz)
Source code(zip)
2.3.1(Jun 19, 2022)
What's Changed

Bug Fixes (BUG)

Forgot to add the interal _codecs subpackage.

Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.3.0...2.3.1
Source code(tar.gz)
Source code(zip)

Owner

Matthew Stamy

GitHub https://pythonhosted.org/PyPDF2/

PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

9 Jan 30, 2022

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022

Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

3 Mar 13, 2022

borb is a library for reading, creating and manipulating PDF files in python.

2.9k Jan 1, 2023

pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 3, 2023

Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox module of the Tkinter library.

1 Feb 13, 2022

Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

1 Nov 30, 2021

Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

1 Nov 9, 2021

Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

5 Apr 2, 2022

Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

1.9k Jan 1, 2023

Merge multiple PDF files into one.

PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen

6 Oct 3, 2022

Program that locks/unlocks pdf files🐍

?? ?? PDFtools ?? ?? Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela ?? Aviso ?? Altere os caminhos referente

1 Nov 4, 2021

Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

130 Dec 26, 2022

Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

1 Feb 9, 2022

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

281 Jan 1, 2023

A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

49 Nov 7, 2022

x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

73 Dec 19, 2022

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

76 Dec 12, 2022

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

4.9k Jan 4, 2023

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

Related tags

Overview

PyPDF2

Examples

Documentation

FAQ

Tests

Comments

PyCryptodome is required for some PDFs, but is not installed automatically as a dependency

Environment

Code + PDF

Traceback

PERF: Use __slots__

More information

How much is it worth?

ROB: ignore_eof everywhere for read_until_regex

Extracting text doesn't work for cropped boxes

Environment

Code + PDF

Random whitespaces are inserted when using page.extract_text()

Environment

Code + PDF

Output

ROB: Ignore EOF in NumberObject.read_from_stream

Releases(3.2.0)

3.2.0(Dec 31, 2022)

What's Changed

Performance Improvement (PI)

New Features (ENH)

Bug Fixes (BUG)

Miscellaneous

New Contributors

3.1.0(Dec 23, 2022)

What's Changed

3.0.0(Dec 22, 2022)

What's Changed

BREAKING CHANGES

New Features (ENH)

Documentation (DOC)

Maintenance (MAINT)

New Contributors

2.12.1(Dec 10, 2022)

What's Changed

Documentation (DOC)

Maintenance (MAINT)

New Contributors

2.12.0(Dec 10, 2022)

What's Changed

Version 2.12.0, 2022-12-10

New Features (ENH)

Bug Fixes (BUG)

Robustness (ROB)

Documentation (DOC)

Maintenance (MAINT)

Testing (TST)

New Contributors

2.11.2(Nov 20, 2022)

What's Changed

New Features (ENH)

Bug Fixes (BUG)

Developer Experience (DEV)

Documentation (DOC)

New Contributors

2.11.1(Oct 9, 2022)

What's Changed

Bug Fixes (BUG)

Robustness (ROB)

Documentation (DOC)

New Contributors

2.11.0(Sep 25, 2022)

What's Changed

New Features (ENH):

Bug Fixes (BUG):

Robustness (ROB):

Testing (TST):

New Contributors

2.10.9(Sep 18, 2022)

What's Changed

New Features (ENH)

PERF: Use slots