A modern pure-Python library for reading PDF files.
The goal is to have a modern interface to handle PDF files which is consistent with itself and typical Python syntax.
The library should be Python-only (hence no C-extensions), but allow to change the backend. Similar in concept to matplotlib backends and Keras backends.
The default backend could be PyPDF2.
Possible other backends could be PyMuPDF (using MuPDF) and PikePDF (using QPDF).
WARNING: This library is UNSTABLE at the moment! Expect many changes!
Installation
pip install pdffile
Usage
Retrieve Metadata
>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> len(doc)
1
>>> doc.metadata
Metadata(
title=None,
producer='pdfTeX-1.40.23',
creator='TeX',
creation_date=datetime.datetime(2022, 4, 3, 18, 5, 42),
modification_date=datetime.datetime(2022, 4, 3, 18, 5, 42)
other={
'/CreationDate': "D:20220403180542+02'00'",
'/ModDate': "D:20220403180542+02'00'",
'/Trapped': '/False',
'/PTEX.Fullbanner': 'This is pdfTeX, V...'})
Encrypted PDFs
If you have an encrypted PDF, just provide the key:
doc = pdf.PdfFile(pdf_path, password=password)
All following operations work just as described.
Get Outline
>>> import pdf
>>> doc = pdf.PdfFile(pdf_path, password=password)
>>> doc.outline
[
Links(page=5, text='1 Header'),
Links(page=5, text='1.1 A section'),
Links(page=9, text='2 Foobar'),
Links(page=108, text='References')
]
Extract Text
>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> doc[0]
<pdf.PdfPage object at 0x7f72d2b04100>
>>> doc[0].text
'Loremipsumdolorsitamet,consetetursadipscingelitr,seddiamnonumyeirmod\ntemporinviduntutlaboreetdoloremagnaaliquyamerat,seddiamvoluptua.Atvero\neosetaccusametjustoduodoloresetearebum.Stetclitakasdgubergren,noseataki-\nmatasanctusestLoremipsumdolorsitamet.Loremipsumdolorsitamet,consetetur\nsadipscingelitr,seddiamnonumyeirmodtemporinviduntutlaboreetdoloremagna\naliquyamerat,seddiamvoluptua.Atveroeosetaccusametjustoduodoloresetea\nrebum.Stetclitakasdgubergren,noseatakimatasanctusestLoremipsumdolorsit\namet.\n1\n'
Alternatively, you can use doc.text
to get the text of all pages.