pikepdf is a Python library for reading and writing PDF files.

Overview

pikepdf

pikepdf is a Python library for reading and writing PDF files.

Build Status PyPI PyPI - Python Version PyPy Language grade: Python Language grade: C/C++ PyPI - License PyPI - Downloads codecov

pikepdf is based on QPDF, a powerful PDF manipulation and repair library.

Python + QPDF = "py" + "qpdf" = "pyqpdf", which looks like a dyslexia test. Say it out loud, and it sounds like "pikepdf".

# Elegant, Pythonic API
with pikepdf.open('input.pdf') as pdf:
    num_pages = len(pdf.pages)
    del pdf.pages[-1]
    pdf.save('output.pdf')

To install:

pip install pikepdf

For users who want to build from source, see installation.

pikepdf is documented and actively maintained. Commercial support is available. We support just about everything x86-64, including PyPy, and Apple Silicon on a best effort basis.

Features

This library is similar to PyPDF2 and pdfrw - it provides low level access to PDF features and allows editing and content transformation of existing PDFs. Some knowledge of the PDF specification may be helpful. It does not have the capability to render a PDF to image.

Feature pikepdf PyPDF2 pdfrw
Editing, manipulation and transformation of existing PDFs
Based on an existing, mature PDF library QPDF
Implementation C++ and Python Python Python
PDF versions supported 1.1 to 1.7 1.3? 1.7
Python versions supported 3.7-3.10 1 2.6-3.6 2.6-3.6
Save and load password protected (encrypted) PDFs (except public key) ✘ (Only obsolete RC4) ✘ (not at all)
Save and load PDF compressed object streams (PDF 1.5)
Creates linearized ("fast web view") PDFs
Actively maintained pikepdf commit activity PyPDF2 commit activity pdfrw commit activity
Test suite coverage codecov very low unknown
Creates PDFs that pass PDF validation tests ?
Modifies PDF/A without breaking PDF/A compliance ?
Automatically repairs PDFs with internal errors
PDF XMP metadata editing read-only
Documentation basic
Integrates with Jupyter and IPython notebooks for rapid development

Testimonials

I decided to try writing a quick Python program with pikepdf to automate [something] and it "just worked". –Jay Berkenbilt, creator of QPDF

"Thanks for creating a great pdf library, I tested out several and this is the one that was best able to work with whatever I threw at it." –@cfcurtis

In Production

  • OCRmyPDF uses pikepdf to graft OCR text layers onto existing PDFs, to examine the contents of input PDFs, and to optimize PDFs.

  • pdfarranger is a small Python application that provides a graphical user interface to rotate, crop and rearrange PDFs.

  • PDFStitcher is a utility for stitching PDF pages into a single document (i.e. N-up or page imposition).

License

pikepdf is provided under the Mozilla Public License 2.0 license (MPL) that can be found in the LICENSE file. By using, distributing, or contributing to this project, you agree to the terms and conditions of this license.

Informally, MPL 2.0 is a not a "viral" license. It may be combined with other work, including commercial software. However, you must disclose your modifications to pikepdf in source code form. In other works, fork this repository on GitHub or elsewhere and commit your contributions there, and you've satisfied your obligations. MPL 2.0 is compatible with the GPL and LGPL - see the guidelines for notes on use in GPL.

The debian/copyright file describes licensing terms for the test suite and the provenance of test resources.

Footnotes

  1. pikepdf 3.x and older support Python 3.6.

Comments
  • pikepdf will have failed test with qpdf 10.6 but can be fixed without breaking compatibility

    pikepdf will have failed test with qpdf 10.6 but can be fixed without breaking compatibility

    When running pikepdf's tests against qpdf 10.6, the following failures occur:

    b = b'\x7f'
    
        @given(binary())
        def test_codec_involution(b):
            # For all binary strings, there is a pdfdoc decoding. The encoding of that
            # decoding recovers the initial string. (However, not all str have a pdfdoc
            # encoding.)
    >       assert b.decode('pdfdoc').encode('pdfdoc') == b
    E       AssertionError: assert b'\x9f' == b'\x7f'
    E         At index 0 diff: b'\x9f' != b'\x7f'
    E         Use -v to get the full diff
    

    and

    s = '\x1f'
    
        @given(text())
        def test_break_encode(s):
            try:
                encoded_bytes = s.encode('pdfdoc')
            except ValueError as e:
                allowed_errors = [
                    "'pdfdoc' codec can't encode character",
                    "'pdfdoc' codec can't process Unicode surrogates",
                    "'pdfdoc' codec can't encode some characters",
                ]
                if any((allowed in str(e)) for allowed in allowed_errors):
                    return
                raise
            else:
    >           assert encoded_bytes.decode('pdfdoc') == s
    E           AssertionError: assert '˜' == '\x1f'
    E             Strings contain only whitespace, escaping them using repr()
    E             - '\x1f'
    E             + '˜'
    
    tests/test_codec.py:52: AssertionError
    

    This is most likely because of qpdf/qpdf#606 which added previously omitted Unicode conversions for PDF Doc Encoding code points 0x18 through 0x1f and 0x7f. If you want to test mapping to an invalid code point, you can pick something lower than 0x18. That should map to the invalid character. Anyway, I'm not sure what correct fix is for your test.

    I plan to release qpdf 10.6 most likely tomorrow, February 8. I plan on preparing everything today. Other than version numbers and final release mechanics, qpdf's main is what 10.6 will look like. At this moment, I haven't yet updated configure.ac and libtool versions, but I will be doing that shortly.

    opened by jberkenbilt 40
  • [Feature Request] - document object parsing

    [Feature Request] - document object parsing

    Hi Parsing a PDF in QDF mode is relatively easy because it's all text e.g. it's easy to identify that this is a data table cell:

    112 0 obj << /K 72 0 R /P 60 0 R /S /TD /A 147 0 R>> endobj

    and it's straigtforward to change it to a /TH or to find the /A object to add a /Headers attribute. My question is: is it possible to do this using pikePDF? could you please provide a code example in the documentation? Thank you very much!

    opened by pfrederiksen40 18
  • pikepdf._qpdf.PdfError on a specific PDF file with pikepdf >= 4.5.0

    pikepdf._qpdf.PdfError on a specific PDF file with pikepdf >= 4.5.0

    Hi, I have a bug with a specific PDF (which I can't publish unfortunately).

    The error is the following:

    Traceback (most recent call last):
      File "/tests/test_pikepdf.py", line 10, in test_pikepdf
        pdf.save("/tmp/toto.pdf")
      File "/usr/local/lib/python3.8/site-packages/pikepdf/_methods.py", line 774, in save
        self._save(
    pikepdf._qpdf.PdfError: operation for dictionary attempted on object of type null: returning null for attempted key retrieval
    

    It happens with this python code, on the save statement:

    import pikepdf
    from pathlib import Path
    fpath = Path("tests/fixtures/scanned_rotation.pdf")
    pdf = pikepdf.Pdf.open(fpath)
    pdf.save("/tmp/toto.pdf")
    

    This is my Dockerfile to reproduce the problem:

    FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8
    RUN apt-get update -q -y && apt-get upgrade -q -y
    RUN python -m pip install --upgrade pip
    RUN pip install --no-cache-dir install pikepdf
    COPY app /app
    COPY tests /tests
    WORKDIR /app
    

    It worked until pikepdf version 4.4.1 but from 4.5.0 to current version (5.0.1) it raises this pikepdf._qpdf.PdfError exception.

    Thank you for your help and support

    opened by lsamper 17
  • compatibly support build with qpdf current main (cmake, PointerHolder)

    compatibly support build with qpdf current main (cmake, PointerHolder)

    qpdf's build is switching from autoconf and libdir to cmake. These changes switch the QPDF_SOURCE_TREE bit of pikepdf so that you can test against a qpdf source tree after the cmake transition is complete. I have tested these changes with my cmake branch.

    The changes to setup.py are made such that they are backward compatible with a pre-cmake build. The documentation changes are not. (By the way, there was a typo in the old docs -- they mentioned .build/libs in one place instead of build/.libs.

    ~~Please #316 for a separate PR that includes only the setup.py change in case you wanted to merge that earlier. I haven't looked in pikepdf for other references to any qpdf build, but the change to setup.py is sufficient to allow me to continue to test pikepdf against qpdf as I move forward post-cmake.~~

    The setup.py change could be accepted at any time. The documentation change may be premature to accept since it won't be accurate until a version of qpdf that uses cmake is released. I am still deciding whether to release qpdf 10.6.x with cmake (the cmake build is fully binary compatible with the autoconf build) or whether to wait.

    I will be making some kind of announcement to qpdf-announce about testing the cmake build soon.

    opened by jberkenbilt 16
  • Need an example of apply watermark

    Need an example of apply watermark

    hi,

    Need an example of applying watermark.

    Tested with following code, It doesn't work as I exptected, the "watermark" was not on the right location.

    from reportlab.pdfgen import canvas
    from pikepdf import Array, Dictionary, Name, Pdf, PdfMatrix, Stream
    
    
    INPUT_PDF="./test3.pdf"
    WATERMARK_PDF='./test4.pdf'
    OUTPUT_PDF='./test3_test4.pdf'
    
    def generate_watermark(msg,fileName,x=55,y=220):
        c = canvas.Canvas(fileName, bottomup=0)
        c.setFontSize(32)
        c.setFillColorCMYK(0, 0, 0, 0, alpha=0.7)
        c.rect(204, 199, 157, 15, stroke=0, fill=1)
        c.setFillColorCMYK(0, 0, 0, 100, alpha=0.7)
        c.drawString(x, y,msg )
        c.save()
       
    # generate two pdfs
    generate_watermark('file3',INPUT_PDF,100,100)
    generate_watermark('file4',WATERMARK_PDF)
    
    
    with pikepdf.open(INPUT_PDF) as input_pdf, \
                pikepdf.open(WATERMARK_PDF) as watermark_pdf, \
                open(OUTPUT_PDF, 'wb') as output_stream:
            
            # Create new output PDF
            output_pdf = pikepdf.new()
    
    
            for i in range(len(input_pdf.pages)):
                #load and insert watermark
                input_pdf.pages[i].page_contents_add(watermark_pdf.pages[0].Contents)
                input_pdf.pages[i].page_contents_coalesce()
    
            output_pdf.pages.extend(input_pdf.pages)
            output_pdf.save(output_stream)  # save to a new file
    
    
    question 
    opened by StevenLOL 15
  • 5.0.1 fails to build wheel

    5.0.1 fails to build wheel

    I'm attempting to build pikepdf 5.0.1 for arm64 and armv7 architectures, for use Docker.

    When I attempt to build pikepdf, using the following steps, it will fail due to something about the pyproject.toml file.

    Build Steps

    1. git clone --quiet --depth 1 --branch v5.0.1 https://github.com/pikepdf/pikepdf.git
    2. cd pikepdf
    3. mkdir wheels
    4. python3 -m pip wheel . -w wheels

    Error

    configuration error: `project` must contain ['name'] properties
    ...
    ValueError: invalid pyproject.toml config: `project`
    

    Is there a specific pip, setuptools or wheel required to build?

    opened by stumpylog 14
  • Python library pikepdf not allowing duplicate pages to be inserted in a pdf

    Python library pikepdf not allowing duplicate pages to be inserted in a pdf

    With a given single page input pdf, I tried to create a pdf with two copies of the first page making it a new two page pdf document. I received the following error: PdfError: empty PDF (page 1 (numbered from zero): object 3 0): duplicate page reference found; this would cause loss of data

    Can anyone help with this issue? Thanks!

    opened by a-kaly 13
  • macOS 10.13 High Sierra - ImportError: pikepdf's extension library failed to import

    macOS 10.13 High Sierra - ImportError: pikepdf's extension library failed to import

    While using pikepdf along with Django projects its throwing -

    ImportError: pikepdf's extension library failed to import. Any ideas if I am missing anything, I did built the qpdf repo as well.

    opened by shashitechno 12
  • Moving py::gil_scoped_release after make_unique InputSource calls.

    Moving py::gil_scoped_release after make_unique InputSource calls.

    Fixing GIL-not-held issues.

    The Python GIL is not held when Py_INCREF is invoked (via pybind11) in these two lines:

    • https://github.com/rwgk/pikepdf/blob/63b660b68b52a0f99ced3015fe4acb66fb6ca8d5/src/qpdf/mmap_inputsource-inl.h#L45 when called from https://github.com/rwgk/pikepdf/blob/63b660b68b52a0f99ced3015fe4acb66fb6ca8d5/src/qpdf/qpdf.cpp#L103
    • https://github.com/rwgk/pikepdf/blob/63b660b68b52a0f99ced3015fe4acb66fb6ca8d5/src/qpdf/qpdf_inputsource-inl.h#L31 when called from https://github.com/rwgk/pikepdf/blob/63b660b68b52a0f99ced3015fe4acb66fb6ca8d5/src/qpdf/qpdf.cpp#L121

    How was this discovered?

    I'm working on a systematic cleanup of the extended Google codebase, which imports this github project.

    I'm globally testing with this Python patch (Python 3.7):

    +int PyGILState_Check(void); /* Include/internal/pystate.h */
    +
     #define Py_INCREF(op) (                         \
    +    assert(PyGILState_Check()),                 \
         _Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
         ((PyObject *)(op))->ob_refcnt++)
    
     #define Py_DECREF(op)                                   \
         do {                                                \
    +        assert(PyGILState_Check());                     \
             PyObject *_py_decref_tmp = (PyObject *)(op);    \
             if (_Py_DEC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
             --(_py_decref_tmp)->ob_refcnt != 0)             \
    

    The pikepdf tests fail, but are fixed with the change under this PR.

    General background: The GIL must be held when calling any Python C API functions. In multithreaded applications that use callbacks this requirement can easily be violated by accident. A general tool to ensure GIL health is not available, but patching Python Py_INCREF & Py_DECREF as above provides a basic health check.

    More background for easy reference: https://docs.python.org/3/glossary.html#term-global-interpreter-lock

    Purely FYI: this was another issue uncovered with the patch above: https://reviews.llvm.org/D114722

    opened by rwgk 11
  • Fix jbig2dec not working on Windows due to exclusive-locked temporary files.

    Fix jbig2dec not working on Windows due to exclusive-locked temporary files.

    This patch workarounds a python bug that renders jbig2dec unusable on Windows platforms.

    This is due to temporary files being locked in exclusive mode. See here for more details:

    https://bugs.python.org/issue14243

    Example of error:

    subprocess.CalledProcessError: Command '['jbig2dec', '-e', '-o', 'E:\Users\kraptor\AppData\Local\Temp\tmpcxmi9rk9', 'E:\Users\kraptor\AppData\Local\Temp\tmpf200vgs_']' returned non-zero exit status 1.

    opened by kraptor 11
  • jbig2dec error

    jbig2dec error

    Hi, I installed this from anaconda (using windows 10, so don't know if it works necessarily, anyone else?)

    and got this after I put an Exception handler in to tell me more :

    (desurvey) C:\Users\rscott\OneDrive - OZ Minerals\Exploration\Research\Python>python pdftestpike5.py
      0%|                                                                                  | 1/739 [00:00<06:57,  1.77it/s]got jbig2dec ok
    error opening C:\Users\rscott\AppData\Local\Temp\tmpjgzipp79
    Command '['jbig2dec', '-e', '-o', 'C:\\Users\\rscott\\AppData\\Local\\Temp\\tmp70ma71kl', 'C:\\Users\\rscott\\AppData\\Local\\Temp\\tmpjgzipp79']' returned non-zero exit status 1.
      0%|                                                                                  | 1/739 [00:00<07:48,  1.58it/s]
    Traceback (most recent call last):
      File "pdftestpike5.py", line 88, in <module>
        out = image.extract_to(fileprefix=f"{fileout}-page{i:03}-img{j:03}")
      File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\desurvey\lib\site-packages\pikepdf\models\image.py", line 568, in extract_to
        extension = self._extract_to_stream(stream=bio)
      File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\desurvey\lib\site-packages\pikepdf\models\image.py", line 525, in _extract_to_stream
        raise UnsupportedImageTypeError(repr(self))
    pikepdf.models.image.UnsupportedImageTypeError: <pikepdf.PdfImage image mode=1 size=3906x2530 at 0x1d4ed183d30>
    

    hopefully will get to try it on linux later

    bug 
    opened by RichardScottOZ 11
  • Cannot install pikepdf on Python 3.11.1 on Windows

    Cannot install pikepdf on Python 3.11.1 on Windows

    @jbarlow83 I documented the error trying to install pikepdf on Windows Server 2019 using python 3.11.1 in this thread - https://github.com/ocrmypdf/OCRmyPDF/issues/460#issuecomment-1366127238 - but wanted to create an issue here so it wasn't lost. Have the wheels been released for 3.11 so pikepdf can be updated?

    python -m pip install pikepdf
    Collecting pikepdf
      Using cached pikepdf-6.2.6.tar.gz (2.9 MB)
      Installing build dependencies ... done
      Getting requirements to build wheel ... done
      Preparing metadata (pyproject.toml) ... done
    Collecting Pillow>=9.0
      Using cached Pillow-9.3.0-cp311-cp311-win32.whl (2.2 MB)
    Collecting deprecation
      Using cached deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
    Requirement already satisfied: lxml>=4.8 in c:\python311-32\lib\site-packages (from pikepdf) (4.9.2)
    Requirement already satisfied: packaging in c:\python311-32\lib\site-packages (from pikepdf) (22.0)
    Building wheels for collected packages: pikepdf
      Building wheel for pikepdf (pyproject.toml) ... error
      error: subprocess-exited-with-error
    
      × Building wheel for pikepdf (pyproject.toml) did not run successfully.
      │ exit code: 1
      ╰─> [52 lines of output]
          C:\Users\MMEadmin\AppData\Local\Temp\2\pip-build-env-f01r4mhj\overlay\Lib\site-packages\setuptools_scm\git.py:295: UserWarning: git archive did not support describe output
            warnings.warn("git archive did not support describe output")
          C:\Users\MMEadmin\AppData\Local\Temp\2\pip-build-env-f01r4mhj\overlay\Lib\site-packages\setuptools_scm\git.py:312: UserWarning: unexported git archival found
            warnings.warn("unexported git archival found")
          running bdist_wheel
          running build
          running build_py
          creating build
          creating build\lib.win32-cpython-311
          creating build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\codec.py -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\jbig2.py -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\objects.py -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\settings.py -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\_augments.py -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\_cpphelpers.py -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\_exceptions.py -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\_methods.py -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\_version.py -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\_xml.py -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\__init__.py -> build\lib.win32-cpython-311\pikepdf
          creating build\lib.win32-cpython-311\pikepdf\models
          copying src\pikepdf\models\encryption.py -> build\lib.win32-cpython-311\pikepdf\models
          copying src\pikepdf\models\image.py -> build\lib.win32-cpython-311\pikepdf\models
          copying src\pikepdf\models\matrix.py -> build\lib.win32-cpython-311\pikepdf\models
          copying src\pikepdf\models\metadata.py -> build\lib.win32-cpython-311\pikepdf\models
          copying src\pikepdf\models\outlines.py -> build\lib.win32-cpython-311\pikepdf\models
          copying src\pikepdf\models\_content_stream.py -> build\lib.win32-cpython-311\pikepdf\models
          copying src\pikepdf\models\_transcoding.py -> build\lib.win32-cpython-311\pikepdf\models
          copying src\pikepdf\models\__init__.py -> build\lib.win32-cpython-311\pikepdf\models
          running egg_info
          writing src\pikepdf.egg-info\PKG-INFO
          writing dependency_links to src\pikepdf.egg-info\dependency_links.txt
          writing requirements to src\pikepdf.egg-info\requires.txt
          writing top-level names to src\pikepdf.egg-info\top_level.txt
          listing git files failed - pretending there aren't any
          reading manifest file 'src\pikepdf.egg-info\SOURCES.txt'
          adding license file 'LICENSE.txt'
          adding license file 'licenses-for-wheels.txt'
          writing manifest file 'src\pikepdf.egg-info\SOURCES.txt'
          copying src\pikepdf\_qpdf.pyi -> build\lib.win32-cpython-311\pikepdf
          copying src\pikepdf\py.typed -> build\lib.win32-cpython-311\pikepdf
          running build_ext
          building 'pikepdf._qpdf' extension
          creating build\temp.win32-cpython-311
          creating build\temp.win32-cpython-311\Release
          creating build\temp.win32-cpython-311\Release\src
          creating build\temp.win32-cpython-311\Release\src\qpdf
          "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.34.31933\bin\HostX86\x86\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -DPOINTERHOLDER_TRANSITION=4 -IC:\Users\MMEadmin\AppData\Local\Temp\2\pip-build-env-f01r4mhj\overlay\Lib\site-packages\pybind11\include -IC:\Python311-32\include -IC:\Python311-32\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.34.31933\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22000.0\\cppwinrt" /EHsc /Tpsrc/qpdf\annotation.cpp /Fobuild\temp.win32-cpython-311\Release\src/qpdf\annotation.obj /EHsc /bigobj /std:c++17
          annotation.cpp
          src/qpdf\annotation.cpp(4): fatal error C1083: Cannot open include file: 'qpdf/Constants.h': No such file or directory
          error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.34.31933\\bin\\HostX86\\x86\\cl.exe' failed with exit code 2
          [end of output]
    
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for pikepdf
    Failed to build pikepdf
    ERROR: Could not build wheels for pikepdf, which is required to install pyproject.toml-based projects
    
    opened by champlin2 3
  • Error on image extraction

    Error on image extraction

    Hi

    I was trying to extract images from PDF using ext = pdfimage.extract_to(stream=stream)

    On all my PDF files I got an exception on the 485 line of image.py the self.filter_decodeparms[0][0] was '/DCTDecode' but the self.filter_decodeparms[0][1] was None

    Your original code try to get the dictionary value and defaults to 1 I purpose default to 1 in case the given dictionary is None.

    After fix the images were correctly saved.

    Original

            def normal_dct_rgb() -> bool:
                # Normal DCTDecode RGB images have the default value of
                # /ColorTransform 1 and are actually in YUV. Such a file can be
                # saved as a standard JPEG. RGB JPEGs without YUV conversion can't
                # be saved as JPEGs, and are probably bugs. Some software in the
                # wild actually produces RGB JPEGs in PDFs (probably a bug).
                DEFAULT_CT_RGB = 1
    
                ct = self.filter_decodeparms[0][1].get('/ColorTransform', DEFAULT_CT_RGB)
                # ERROR self.filter_decodeparms[0][1] is None
    
                return self.mode == 'RGB' and ct == DEFAULT_CT_RGB
    

    Fix

            def normal_dct_rgb() -> bool:
                # Normal DCTDecode RGB images have the default value of
                # /ColorTransform 1 and are actually in YUV. Such a file can be
                # saved as a standard JPEG. RGB JPEGs without YUV conversion can't
                # be saved as JPEGs, and are probably bugs. Some software in the
                # wild actually produces RGB JPEGs in PDFs (probably a bug).
                DEFAULT_CT_RGB = 1
                ct = DEFAULT_CT_RGB
                if self.filter_decodeparms[0] is not None and self.filter_decodeparms[0][1] is not None:
                    self.filter_decodeparms[0][1].get('/ColorTransform', DEFAULT_CT_RGB)
                return self.mode == 'RGB' and ct == DEFAULT_CT_RGB
    
    opened by neojg 5
  • Error: read_bytes called on unfilterable stream for a simple PDF

    Error: read_bytes called on unfilterable stream for a simple PDF

    Running the following simple code is returning the error. Code:

    from pikepdf import Pdf, PdfImage filename = "/home/user/hobbiate/finlens/src/finlensapp/tests/data/ocr/bronze/20 image files/MyFile.pdf" pdf = Pdf.open(filename) for page in pdf.pages: orig_keys = list(page.images.keys()) for index, key in enumerate(orig_keys): print(f"Processing key {key}") try: rawimage = page.images[key] pdfimage = PdfImage(rawimage) img = pdfimage.as_pil_image() except Exception as e: print(e)

    Error

    (object 6,0, offset 915): read_bytes called on unfilterable stream

    Pip list output:

    Package Version


    deprecation 2.1.0
    lxml 4.9.1
    packaging 21.3
    pikepdf 6.2.4
    Pillow 9.3.0
    pip 20.0.2 pkg-resources 0.0.0
    pyparsing 3.0.9
    setuptools 44.0.0

    Operating System:

    Ubuntu

    opened by yashsemwal 3
  • 'ValueError: buffer is not large enough' on PdfImage().extract_to() on some pngs

    'ValueError: buffer is not large enough' on PdfImage().extract_to() on some pngs

    I'm getting

        tmpFileName = pdfImage.extract_to(fileprefix = "tmp")
      File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\image.py", line 668, in extract_to
        extension = self._extract_to_stream(stream=bio)
      File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\image.py", line 611, in _extract_to_stream
        im = self._extract_transcoded()
      File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\image.py", line 581, in _extract_transcoded
        im = self._extract_transcoded_1248bits()
      File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\image.py", line 528, in _extract_transcoded_1248bits
        im = _transcoding.image_from_buffer_and_palette(
      File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\_transcoding.py", line 143, in image_from_buffer_and_palette
        im = image_from_byte_buffer(buffer, size, stride)
      File "C:\ProgramData\Anaconda3\lib\site-packages\pikepdf\models\_transcoding.py", line 107, in image_from_byte_buffer
        return Image.frombuffer('L', size, buffer, "raw", 'L', stride, ystep)
      File "C:\ProgramData\Anaconda3\lib\site-packages\PIL\Image.py", line 2932, in frombuffer
        im = im._new(core.map_buffer(data, size, decoder_name, 0, args))
    ValueError: buffer is not large enough
    

    when trying to extract pngs from some pdfs. Most pngs are extracted correctly, but some are causing such exception. I tried to debug a bit, but except of "wrong" mode is given to PIL.Image.frombuffer() I was unable to find the issue. By "wrong" I mean always sending 'L' there, when at least in case of that problematic png self.mode == 'P'. I have no idea what it is about, but this is the only thing I was able to notice.

    The code I'm using:

    import os
    from pathlib import Path
    from pikepdf import Name, Pdf, PdfImage
    
    files = [f for f in os.listdir('.') if os.path.isfile(f) and str(f).endswith(".pdf")]
    for fileName in files:
        pdfFile = Pdf.open(fileName, allow_overwriting_input = True)
        for page in pdfFile.pages:
            for j, (name, rawImage) in enumerate(page.images.items()):
                pdfImage = PdfImage(rawImage)
                tmpFileName = pdfImage.extract_to(fileprefix = "tmp")
    
     # some unrelated work is done here
    
        pdfFile.save()
        pdfFile.close()
    

    It crashes on element

    1318 0 obj
    << /BitsPerComponent 8 /ColorSpace 636 0 R /Height 302 /Subtype /Image /Width 205 /Length 58912 >>
    

    from attached pdf. Dyko.pdf

    opened by AlexMatiash 0
  • Outline link issue when using page_location='XYZ'

    Outline link issue when using page_location='XYZ'

    When creating an outline, use page_location=‘XYZ' to meet the needs of ''Zoom level: Inherit Zoom".

    The following code can meet the zoom requirement well, but it will automatically increase the destination page by 1, which will prevent me from jumping to the first page anyway.

    `from pikepdf import Pdf, OutlineItem

    path=r'~'

    pdf=Pdf.open(path+"/"+"tt.pdf")

    with pdf.open_outline() as outline: outline.root.extend([ OutlineItem('Section One', 0**,page_location="XYZ"**) ])

    pdf.save(path+"/"+"tt1.pdf") pdf.close()`

    The 'Section One' button should have jumped to the first page, but now it has jumped to the second page. If I remove the bold 'page_location="XYZ"' in the above code, the jump end point is correct, but there is no ’Inherit Zoom‘ property

    Note: Python:3.8.5 pikepdf: 6.2.0

    opened by sldzys 0
PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

null 9 Jan 30, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

Matthew Stamy 5k Jan 4, 2023
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox module of the Tkinter library.

RISHABH MISHRA 1 Feb 13, 2022
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

null 1 Nov 30, 2021
Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

Hemachandran P 1 Nov 9, 2021
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 2, 2022
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

Nemo 387 Dec 10, 2022
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 1, 2023
Merge multiple PDF files into one.

PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen

Duo Apps 6 Oct 3, 2022
Program that locks/unlocks pdf files🐍

?? ?? PDFtools ?? ?? Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela ?? Aviso ?? Altere os caminhos referente

João Victor Vilela dos Santos 1 Nov 4, 2021
Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

null 130 Dec 26, 2022
Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

Will Fantom 1 Feb 9, 2022
This book will take you on an exploratory journey through the PDF format, and the borb Python library.

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

Joris Schellekens 281 Jan 1, 2023
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 7, 2022
x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

Free Law Project 73 Dec 19, 2022
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

null 1.8k Dec 29, 2022
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 4, 2023