pikepdf is a Python library for reading and writing PDF files.

Last update: Jan 3, 2023

Related tags

PDF Files Processing python pdf pdf-generation pypdf2 pdf-manipulation qpdf existing-pdfs pikepdf

Overview

pikepdf

pikepdf is a Python library for reading and writing PDF files.

pikepdf is based on QPDF, a powerful PDF manipulation and repair library.

Python + QPDF = "py" + "qpdf" = "pyqpdf", which looks like a dyslexia test. Say it out loud, and it sounds like "pikepdf".

# Elegant, Pythonic API
with pikepdf.open('input.pdf') as pdf:
    num_pages = len(pdf.pages)
    del pdf.pages[-1]
    pdf.save('output.pdf')

To install:

pip install pikepdf

For users who want to build from source, see installation.

pikepdf is documented and actively maintained. Commercial support is available. We support just about everything x86-64, including PyPy, and Apple Silicon on a best effort basis.

Features

This library is similar to PyPDF2 and pdfrw - it provides low level access to PDF features and allows editing and content transformation of existing PDFs. Some knowledge of the PDF specification may be helpful. It does not have the capability to render a PDF to image.

Feature	pikepdf	PyPDF2	pdfrw
Editing, manipulation and transformation of existing PDFs	✔	✔	✔
Based on an existing, mature PDF library	QPDF	✘	✘
Implementation	C++ and Python	Python	Python
PDF versions supported	1.1 to 1.7	1.3?	1.7
Python versions supported	3.7-3.10 ¹	2.6-3.6	2.6-3.6
Save and load password protected (encrypted) PDFs	✔ (except public key)	✘ (Only obsolete RC4)	✘ (not at all)
Save and load PDF compressed object streams (PDF 1.5)	✔	✘	✘
Creates linearized ("fast web view") PDFs	✔	✘	✘
Actively maintained
Test suite coverage		very low	unknown
Creates PDFs that pass PDF validation tests	✔	✘	?
Modifies PDF/A without breaking PDF/A compliance	✔	✘	?
Automatically repairs PDFs with internal errors	✔	✘	✘
PDF XMP metadata editing	✔	read-only	✘
Documentation	✔	basic	✔
Integrates with Jupyter and IPython notebooks for rapid development	✔	✘	✘

Testimonials

I decided to try writing a quick Python program with pikepdf to automate [something] and it "just worked". –Jay Berkenbilt, creator of QPDF

"Thanks for creating a great pdf library, I tested out several and this is the one that was best able to work with whatever I threw at it." –@cfcurtis

In Production

OCRmyPDF uses pikepdf to graft OCR text layers onto existing PDFs, to examine the contents of input PDFs, and to optimize PDFs.
pdfarranger is a small Python application that provides a graphical user interface to rotate, crop and rearrange PDFs.
PDFStitcher is a utility for stitching PDF pages into a single document (i.e. N-up or page imposition).

License

pikepdf is provided under the Mozilla Public License 2.0 license (MPL) that can be found in the LICENSE file. By using, distributing, or contributing to this project, you agree to the terms and conditions of this license.

Informally, MPL 2.0 is a not a "viral" license. It may be combined with other work, including commercial software. However, you must disclose your modifications to pikepdf in source code form. In other works, fork this repository on GitHub or elsewhere and commit your contributions there, and you've satisfied your obligations. MPL 2.0 is compatible with the GPL and LGPL - see the guidelines for notes on use in GPL.

The debian/copyright file describes licensing terms for the test suite and the provenance of test resources.

pikepdf 3.x and older support Python 3.6. ↩

Comments

pikepdf will have failed test with qpdf 10.6 but can be fixed without breaking compatibility

When running pikepdf's tests against qpdf 10.6, the following failures occur:

b = b'\x7f'

    @given(binary())
    def test_codec_involution(b):
        # For all binary strings, there is a pdfdoc decoding. The encoding of that
        # decoding recovers the initial string. (However, not all str have a pdfdoc
        # encoding.)
>       assert b.decode('pdfdoc').encode('pdfdoc') == b
E       AssertionError: assert b'\x9f' == b'\x7f'
E         At index 0 diff: b'\x9f' != b'\x7f'
E         Use -v to get the full diff

and

s = '\x1f'

    @given(text())
    def test_break_encode(s):
        try:
            encoded_bytes = s.encode('pdfdoc')
        except ValueError as e:
            allowed_errors = [
                "'pdfdoc' codec can't encode character",
                "'pdfdoc' codec can't process Unicode surrogates",
                "'pdfdoc' codec can't encode some characters",
            ]
            if any((allowed in str(e)) for allowed in allowed_errors):
                return
            raise
        else:
>           assert encoded_bytes.decode('pdfdoc') == s
E           AssertionError: assert '˜' == '\x1f'
E             Strings contain only whitespace, escaping them using repr()
E             - '\x1f'
E             + '˜'

tests/test_codec.py:52: AssertionError

This is most likely because of qpdf/qpdf#606 which added previously omitted Unicode conversions for PDF Doc Encoding code points 0x18 through 0x1f and 0x7f. If you want to test mapping to an invalid code point, you can pick something lower than 0x18. That should map to the invalid character. Anyway, I'm not sure what correct fix is for your test.

I plan to release qpdf 10.6 most likely tomorrow, February 8. I plan on preparing everything today. Other than version numbers and final release mechanics, qpdf's main is what 10.6 will look like. At this moment, I haven't yet updated configure.ac and libtool versions, but I will be doing that shortly.

opened by jberkenbilt 40

[Feature Request] - document object parsing

Hi Parsing a PDF in QDF mode is relatively easy because it's all text e.g. it's easy to identify that this is a data table cell:

112 0 obj << /K 72 0 R /P 60 0 R /S /TD /A 147 0 R>> endobj

and it's straigtforward to change it to a /TH or to find the /A object to add a /Headers attribute. My question is: is it possible to do this using pikePDF? could you please provide a code example in the documentation? Thank you very much!

opened by pfrederiksen40 18

pikepdf._qpdf.PdfError on a specific PDF file with pikepdf >= 4.5.0

Hi, I have a bug with a specific PDF (which I can't publish unfortunately).

The error is the following:

Traceback (most recent call last):
  File "/tests/test_pikepdf.py", line 10, in test_pikepdf
    pdf.save("/tmp/toto.pdf")
  File "/usr/local/lib/python3.8/site-packages/pikepdf/_methods.py", line 774, in save
    self._save(
pikepdf._qpdf.PdfError: operation for dictionary attempted on object of type null: returning null for attempted key retrieval

It happens with this python code, on the save statement:

import pikepdf
from pathlib import Path
fpath = Path("tests/fixtures/scanned_rotation.pdf")
pdf = pikepdf.Pdf.open(fpath)
pdf.save("/tmp/toto.pdf")

This is my Dockerfile to reproduce the problem:

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8
RUN apt-get update -q -y && apt-get upgrade -q -y
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir install pikepdf
COPY app /app
COPY tests /tests
WORKDIR /app

It worked until pikepdf version 4.4.1 but from 4.5.0 to current version (5.0.1) it raises this pikepdf._qpdf.PdfError exception.

Thank you for your help and support

opened by lsamper 17

compatibly support build with qpdf current main (cmake, PointerHolder)

qpdf's build is switching from autoconf and libdir to cmake. These changes switch the QPDF_SOURCE_TREE bit of pikepdf so that you can test against a qpdf source tree after the cmake transition is complete. I have tested these changes with my cmake branch.

The changes to setup.py are made such that they are backward compatible with a pre-cmake build. The documentation changes are not. (By the way, there was a typo in the old docs -- they mentioned .build/libs in one place instead of build/.libs.

~~Please #316 for a separate PR that includes only the setup.py change in case you wanted to merge that earlier. I haven't looked in pikepdf for other references to any qpdf build, but the change to setup.py is sufficient to allow me to continue to test pikepdf against qpdf as I move forward post-cmake.~~

The setup.py change could be accepted at any time. The documentation change may be premature to accept since it won't be accurate until a version of qpdf that uses cmake is released. I am still deciding whether to release qpdf 10.6.x with cmake (the cmake build is fully binary compatible with the autoconf build) or whether to wait.

I will be making some kind of announcement to qpdf-announce about testing the cmake build soon.

opened by jberkenbilt 16

Need an example of apply watermark

hi,

Need an example of applying watermark.

Tested with following code, It doesn't work as I exptected, the "watermark" was not on the right location.

from reportlab.pdfgen import canvas
from pikepdf import Array, Dictionary, Name, Pdf, PdfMatrix, Stream


INPUT_PDF="./test3.pdf"
WATERMARK_PDF='./test4.pdf'
OUTPUT_PDF='./test3_test4.pdf'

def generate_watermark(msg,fileName,x=55,y=220):
    c = canvas.Canvas(fileName, bottomup=0)
    c.setFontSize(32)
    c.setFillColorCMYK(0, 0, 0, 0, alpha=0.7)
    c.rect(204, 199, 157, 15, stroke=0, fill=1)
    c.setFillColorCMYK(0, 0, 0, 100, alpha=0.7)
    c.drawString(x, y,msg )
    c.save()
   
# generate two pdfs
generate_watermark('file3',INPUT_PDF,100,100)
generate_watermark('file4',WATERMARK_PDF)


with pikepdf.open(INPUT_PDF) as input_pdf, \
            pikepdf.open(WATERMARK_PDF) as watermark_pdf, \
            open(OUTPUT_PDF, 'wb') as output_stream:
        
        # Create new output PDF
        output_pdf = pikepdf.new()


        for i in range(len(input_pdf.pages)):
            #load and insert watermark
            input_pdf.pages[i].page_contents_add(watermark_pdf.pages[0].Contents)
            input_pdf.pages[i].page_contents_coalesce()

        output_pdf.pages.extend(input_pdf.pages)
        output_pdf.save(output_stream)  # save to a new file

question

opened by StevenLOL 15

5.0.1 fails to build wheel
I'm attempting to build pikepdf 5.0.1 for arm64 and armv7 architectures, for use Docker.

When I attempt to build pikepdf, using the following steps, it will fail due to something about the pyproject.toml file.

Build Steps

git clone --quiet --depth 1 --branch v5.0.1 https://github.com/pikepdf/pikepdf.git

cd pikepdf

mkdir wheels

python3 -m pip wheel . -w wheels

Error

configuration error: `project` must contain ['name'] properties ... ValueError: invalid pyproject.toml config: `project`

Is there a specific pip, setuptools or wheel required to build?
opened by stumpylog 14
Python library pikepdf not allowing duplicate pages to be inserted in a pdf

With a given single page input pdf, I tried to create a pdf with two copies of the first page making it a new two page pdf document. I received the following error: PdfError: empty PDF (page 1 (numbered from zero): object 3 0): duplicate page reference found; this would cause loss of data

Can anyone help with this issue? Thanks!

opened by a-kaly 13
macOS 10.13 High Sierra - ImportError: pikepdf's extension library failed to import

While using pikepdf along with Django projects its throwing -

ImportError: pikepdf's extension library failed to import. Any ideas if I am missing anything, I did built the qpdf repo as well.

opened by shashitechno 12
Moving py::gil_scoped_release after make_unique InputSource calls.
Fixing GIL-not-held issues.

The Python GIL is not held when Py_INCREF is invoked (via pybind11) in these two lines:

https://github.com/rwgk/pikepdf/blob/63b660b68b52a0f99ced3015fe4acb66fb6ca8d5/src/qpdf/mmap_inputsource-inl.h#L45 when called from https://github.com/rwgk/pikepdf/blob/63b660b68b52a0f99ced3015fe4acb66fb6ca8d5/src/qpdf/qpdf.cpp#L103

https://github.com/rwgk/pikepdf/blob/63b660b68b52a0f99ced3015fe4acb66fb6ca8d5/src/qpdf/qpdf_inputsource-inl.h#L31 when called from https://github.com/rwgk/pikepdf/blob/63b660b68b52a0f99ced3015fe4acb66fb6ca8d5/src/qpdf/qpdf.cpp#L121

How was this discovered?

I'm working on a systematic cleanup of the extended Google codebase, which imports this github project.

I'm globally testing with this Python patch (Python 3.7):

+int PyGILState_Check(void); /* Include/internal/pystate.h */ + #define Py_INCREF(op) ( \ + assert(PyGILState_Check()), \ _Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA \ ((PyObject *)(op))->ob_refcnt++) #define Py_DECREF(op) \ do { \ + assert(PyGILState_Check()); \ PyObject *_py_decref_tmp = (PyObject *)(op); \ if (_Py_DEC_REFTOTAL _Py_REF_DEBUG_COMMA \ --(_py_decref_tmp)->ob_refcnt != 0) \

The pikepdf tests fail, but are fixed with the change under this PR.

General background: The GIL must be held when calling any Python C API functions. In multithreaded applications that use callbacks this requirement can easily be violated by accident. A general tool to ensure GIL health is not available, but patching Python Py_INCREF & Py_DECREF as above provides a basic health check.

More background for easy reference: https://docs.python.org/3/glossary.html#term-global-interpreter-lock

Purely FYI: this was another issue uncovered with the patch above: https://reviews.llvm.org/D114722
opened by rwgk 11
Fix jbig2dec not working on Windows due to exclusive-locked temporary files.

This patch workarounds a python bug that renders jbig2dec unusable on Windows platforms.

This is due to temporary files being locked in exclusive mode. See here for more details:

https://bugs.python.org/issue14243

Example of error:

subprocess.CalledProcessError: Command '['jbig2dec', '-e', '-o', 'E:\Users\kraptor\AppData\Local\Temp\tmpcxmi9rk9', 'E:\Users\kraptor\AppData\Local\Temp\tmpf200vgs_']' returned non-zero exit status 1.

opened by kraptor 11

jbig2dec error

Hi, I installed this from anaconda (using windows 10, so don't know if it works necessarily, anyone else?)

and got this after I put an Exception handler in to tell me more :

(desurvey) C:\Users\rscott\OneDrive - OZ Minerals\Exploration\Research\Python>python pdftestpike5.py
  0%|                                                                                  | 1/739 [00:00<06:57,  1.77it/s]got jbig2dec ok
error opening C:\Users\rscott\AppData\Local\Temp\tmpjgzipp79
Command '['jbig2dec', '-e', '-o', 'C:\\Users\\rscott\\AppData\\Local\\Temp\\tmp70ma71kl', 'C:\\Users\\rscott\\AppData\\Local\\Temp\\tmpjgzipp79']' returned non-zero exit status 1.
  0%|                                                                                  | 1/739 [00:00<07:48,  1.58it/s]
Traceback (most recent call last):
  File "pdftestpike5.py", line 88, in <module>
    out = image.extract_to(fileprefix=f"{fileout}-page{i:03}-img{j:03}")
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\desurvey\lib\site-packages\pikepdf\models\image.py", line 568, in extract_to
    extension = self._extract_to_stream(stream=bio)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\desurvey\lib\site-packages\pikepdf\models\image.py", line 525, in _extract_to_stream
    raise UnsupportedImageTypeError(repr(self))
pikepdf.models.image.UnsupportedImageTypeError: <pikepdf.PdfImage image mode=1 size=3906x2530 at 0x1d4ed183d30>

hopefully will get to try it on linux later

bug

opened by RichardScottOZ 11

Cannot install pikepdf on Python 3.11.1 on Windows

@jbarlow83 I documented the error trying to install pikepdf on Windows Server 2019 using python 3.11.1 in this thread - https://github.com/ocrmypdf/OCRmyPDF/issues/460#issuecomment-1366127238 - but wanted to create an issue here so it wasn't lost. Have the wheels been released for 3.11 so pikepdf can be updated?

python -m pip install pikepdf
Collecting pikepdf
  Using cached pikepdf-6.2.6.tar.gz (2.9 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting Pillow>=9.0
  Using cached Pillow-9.3.0-cp311-cp311-win32.whl (2.2 MB)
Collecting deprecation
  Using cached deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: lxml>=4.8 in c:\python311-32\lib\site-packages (from pikepdf) (4.9.2)
Requirement already satisfied: packaging in c:\python311-32\lib\site-packages (from pikepdf) (22.0)
Building wheels for collected packages: pikepdf
  Building wheel for pikepdf (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for pikepdf (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [52 lines of output]
      C:\Users\MMEadmin\AppData\Local\Temp\2\pip-build-env-f01r4mhj\overlay\Lib\site-packages\setuptools_scm\git.py:295: UserWarning: git archive did not support describe output
        warnings.warn("git archive did not support describe output")
      C:\Users\MMEadmin\AppData\Local\Temp\2\pip-build-env-f01r4mhj\overlay\Lib\site-packages\setuptools_scm\git.py:312: UserWarning: unexported git archival found
        warnings.warn("unexported git archival found")
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win32-cpython-311
      creating build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\codec.py -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\jbig2.py -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\objects.py -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\settings.py -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\_augments.py -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\_cpphelpers.py -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\_exceptions.py -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\_methods.py -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\_version.py -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\_xml.py -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\__init__.py -> build\lib.win32-cpython-311\pikepdf
      creating build\lib.win32-cpython-311\pikepdf\models
      copying src\pikepdf\models\encryption.py -> build\lib.win32-cpython-311\pikepdf\models
      copying src\pikepdf\models\image.py -> build\lib.win32-cpython-311\pikepdf\models
      copying src\pikepdf\models\matrix.py -> build\lib.win32-cpython-311\pikepdf\models
      copying src\pikepdf\models\metadata.py -> build\lib.win32-cpython-311\pikepdf\models
      copying src\pikepdf\models\outlines.py -> build\lib.win32-cpython-311\pikepdf\models
      copying src\pikepdf\models\_content_stream.py -> build\lib.win32-cpython-311\pikepdf\models
      copying src\pikepdf\models\_transcoding.py -> build\lib.win32-cpython-311\pikepdf\models
      copying src\pikepdf\models\__init__.py -> build\lib.win32-cpython-311\pikepdf\models
      running egg_info
      writing src\pikepdf.egg-info\PKG-INFO
      writing dependency_links to src\pikepdf.egg-info\dependency_links.txt
      writing requirements to src\pikepdf.egg-info\requires.txt
      writing top-level names to src\pikepdf.egg-info\top_level.txt
      listing git files failed - pretending there aren't any
      reading manifest file 'src\pikepdf.egg-info\SOURCES.txt'
      adding license file 'LICENSE.txt'
      adding license file 'licenses-for-wheels.txt'
      writing manifest file 'src\pikepdf.egg-info\SOURCES.txt'
      copying src\pikepdf\_qpdf.pyi -> build\lib.win32-cpython-311\pikepdf
      copying src\pikepdf\py.typed -> build\lib.win32-cpython-311\pikepdf
      running build_ext
      building 'pikepdf._qpdf' extension
      creating build\temp.win32-cpython-311
      creating build\temp.win32-cpython-311\Release
      creating build\temp.win32-cpython-311\Release\src
      creating build\temp.win32-cpython-311\Release\src\qpdf
      "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.34.31933\bin\HostX86\x86\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -DPOINTERHOLDER_TRANSITION=4 -IC:\Users\MMEadmin\AppData\Local\Temp\2\pip-build-env-f01r4mhj\overlay\Lib\site-packages\pybind11\include -IC:\Python311-32\include -IC:\Python311-32\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.34.31933\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\cppwinrt" /EHsc /Tpsrc/qpdf\annotation.cpp /Fobuild\temp.win32-cpython-311\Release\src/qpdf\annotation.obj /EHsc /bigobj /std:c++17
      annotation.cpp
      src/qpdf\annotation.cpp(4): fatal error C1083: Cannot open include file: 'qpdf/Constants.h': No such file or directory
      error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.34.31933\\bin\\HostX86\\x86\\cl.exe' failed with exit code 2
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pikepdf
Failed to build pikepdf
ERROR: Could not build wheels for pikepdf, which is required to install pyproject.toml-based projects

opened by champlin2 3

Error on image extraction

I was trying to extract images from PDF using ext = pdfimage.extract_to(stream=stream)

On all my PDF files I got an exception on the 485 line of image.py the self.filter_decodeparms[0][0] was '/DCTDecode' but the self.filter_decodeparms[0][1] was None

Your original code try to get the dictionary value and defaults to 1 I purpose default to 1 in case the given dictionary is None.

After fix the images were correctly saved.

Original

        def normal_dct_rgb() -> bool:
            # Normal DCTDecode RGB images have the default value of
            # /ColorTransform 1 and are actually in YUV. Such a file can be
            # saved as a standard JPEG. RGB JPEGs without YUV conversion can't
            # be saved as JPEGs, and are probably bugs. Some software in the
            # wild actually produces RGB JPEGs in PDFs (probably a bug).
            DEFAULT_CT_RGB = 1

            ct = self.filter_decodeparms[0][1].get('/ColorTransform', DEFAULT_CT_RGB)
            # ERROR self.filter_decodeparms[0][1] is None

            return self.mode == 'RGB' and ct == DEFAULT_CT_RGB

Fix

        def normal_dct_rgb() -> bool:
            # Normal DCTDecode RGB images have the default value of
            # /ColorTransform 1 and are actually in YUV. Such a file can be
            # saved as a standard JPEG. RGB JPEGs without YUV conversion can't
            # be saved as JPEGs, and are probably bugs. Some software in the
            # wild actually produces RGB JPEGs in PDFs (probably a bug).
            DEFAULT_CT_RGB = 1
            ct = DEFAULT_CT_RGB
            if self.filter_decodeparms[0] is not None and self.filter_decodeparms[0][1] is not None:
                self.filter_decodeparms[0][1].get('/ColorTransform', DEFAULT_CT_RGB)
            return self.mode == 'RGB' and ct == DEFAULT_CT_RGB

opened by neojg 5

Error: read_bytes called on unfilterable stream for a simple PDF

Running the following simple code is returning the error. Code:

from pikepdf import Pdf, PdfImage filename = "/home/user/hobbiate/finlens/src/finlensapp/tests/data/ocr/bronze/20 image files/MyFile.pdf" pdf = Pdf.open(filename) for page in pdf.pages: orig_keys = list(page.images.keys()) for index, key in enumerate(orig_keys): print(f"Processing key {key}") try: rawimage = page.images[key] pdfimage = PdfImage(rawimage) img = pdfimage.as_pil_image() except Exception as e: print(e)

Error

(object 6,0, offset 915): read_bytes called on unfilterable stream

Pip list output:

Package Version

deprecation 2.1.0
lxml 4.9.1
packaging 21.3
pikepdf 6.2.4
Pillow 9.3.0
pip 20.0.2 pkg-resources 0.0.0
pyparsing 3.0.9
setuptools 44.0.0

Operating System:

Ubuntu

opened by yashsemwal 3

'ValueError: buffer is not large enough' on PdfImage().extract_to() on some pngs

I'm getting

    tmpFileName = pdfImage.extract_to(fileprefix = "tmp")
  File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\image.py", line 668, in extract_to
    extension = self._extract_to_stream(stream=bio)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\image.py", line 611, in _extract_to_stream
    im = self._extract_transcoded()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\image.py", line 581, in _extract_transcoded
    im = self._extract_transcoded_1248bits()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\image.py", line 528, in _extract_transcoded_1248bits
    im = _transcoding.image_from_buffer_and_palette(
  File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\_transcoding.py", line 143, in image_from_buffer_and_palette
    im = image_from_byte_buffer(buffer, size, stride)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\_transcoding.py", line 107, in image_from_byte_buffer
    return Image.frombuffer('L', size, buffer, "raw", 'L', stride, ystep)
  File "C:\ProgramData\Anaconda3\lib\site-packages\PIL\Image.py", line 2932, in frombuffer
    im = im._new(core.map_buffer(data, size, decoder_name, 0, args))
ValueError: buffer is not large enough

when trying to extract pngs from some pdfs. Most pngs are extracted correctly, but some are causing such exception. I tried to debug a bit, but except of "wrong" mode is given to PIL.Image.frombuffer() I was unable to find the issue. By "wrong" I mean always sending 'L' there, when at least in case of that problematic png self.mode == 'P'. I have no idea what it is about, but this is the only thing I was able to notice.

The code I'm using:

import os
from pathlib import Path
from pikepdf import Name, Pdf, PdfImage

files = [f for f in os.listdir('.') if os.path.isfile(f) and str(f).endswith(".pdf")]
for fileName in files:
    pdfFile = Pdf.open(fileName, allow_overwriting_input = True)
    for page in pdfFile.pages:
        for j, (name, rawImage) in enumerate(page.images.items()):
            pdfImage = PdfImage(rawImage)
            tmpFileName = pdfImage.extract_to(fileprefix = "tmp")

 # some unrelated work is done here

    pdfFile.save()
    pdfFile.close()

It crashes on element

1318 0 obj
<< /BitsPerComponent 8 /ColorSpace 636 0 R /Height 302 /Subtype /Image /Width 205 /Length 58912 >>

from attached pdf. Dyko.pdf

opened by AlexMatiash 0

Outline link issue when using page_location='XYZ'

When creating an outline, use page_location=‘XYZ' to meet the needs of ''Zoom level: Inherit Zoom".

The following code can meet the zoom requirement well, but it will automatically increase the destination page by 1, which will prevent me from jumping to the first page anyway.

`from pikepdf import Pdf, OutlineItem

path=r'~'

pdf=Pdf.open(path+"/"+"tt.pdf")

with pdf.open_outline() as outline: outline.root.extend([ OutlineItem('Section One', 0**,page_location="XYZ"**) ])

pdf.save(path+"/"+"tt1.pdf") pdf.close()`

The 'Section One' button should have jumped to the first page, but now it has jumped to the second page. If I remove the bold 'page_location="XYZ"' in the above code, the jump end point is correct, but there is no ’Inherit Zoom‘ property

Note: Python:3.8.5 pikepdf: 6.2.0

opened by sldzys 0

Owner

GitHub https://pikepdf.readthedocs.io/

PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

9 Jan 30, 2022

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

5k Jan 4, 2023

Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

3 Mar 13, 2022

Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox module of the Tkinter library.

1 Feb 13, 2022

Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

1 Nov 30, 2021

Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

1 Nov 9, 2021

Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

5 Apr 2, 2022

pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

387 Dec 10, 2022

Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

1.9k Jan 1, 2023

Merge multiple PDF files into one.

PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen

6 Oct 3, 2022

Program that locks/unlocks pdf files🐍

?? ?? PDFtools ?? ?? Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela ?? Aviso ?? Altere os caminhos referente

1 Nov 4, 2021

Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

130 Dec 26, 2022

Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

1 Feb 9, 2022

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

281 Jan 1, 2023

A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

49 Nov 7, 2022

x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

73 Dec 19, 2022

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

76 Dec 12, 2022

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

4.9k Jan 4, 2023