Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Overview

PDFMiner

PDFMiner is a text extraction tool for PDF documents.

Build Status PyPI

Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but this project is largely dormant. For the active project, check out its fork pdfminer.six.

Features:

  • Pure Python (3.6 or above).
  • Supports PDF-1.7. (well, almost)
  • Obtains the exact location of text as well as other layout information (fonts, etc.).
  • Performs automatic layout analysis.
  • Can convert PDF into other formats (HTML/XML).
  • Can extract an outline (TOC).
  • Can extract tagged contents.
  • Supports basic encryption (RC4 and AES).
  • Supports various font types (Type1, TrueType, Type3, and CID).
  • Supports CJK languages and vertical writing scripts.
  • Has an extensible PDF parser that can be used for other purposes.

How to Use:

  1. > pip install pdfminer
  2. > pdf2txt.py samples/simple1.pdf

Command Line Syntax:

pdf2txt.py

pdf2txt.py extracts all the texts that are rendered programmatically. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text segment. It does not recognize text in images. A password needs to be provided for restricted PDF documents.

> pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag]
             [-O output_dir] [-c encoding] [-s scale] [-R rotation]
             [-Y normal|loose|exact] [-p pagenos] [-m maxpages]
             [-S] [-C] [-n] [-A] [-V]
             [-M char_margin] [-L line_margin] [-W word_margin]
             [-F boxes_flow] [-d]
             input.pdf ...
  • -P password : PDF password.
  • -o output : Output file name.
  • -t text|html|xml|tag : Output type. (default: automatically inferred from the output file name.)
  • -O output_dir : Output directory for extracted images.
  • -c encoding : Output encoding. (default: utf-8)
  • -s scale : Output scale.
  • -R rotation : Rotates the page in degree.
  • -Y normal|loose|exact : Specifies the layout mode. (only for HTML output.)
  • -p pagenos : Processes certain pages only.
  • -m maxpages : Limits the number of maximum pages to process.
  • -S : Strips control characters.
  • -C : Disables resource caching.
  • -n : Disables layout analysis.
  • -A : Applies layout analysis for all texts including figures.
  • -V : Automatically detects vertical writing.
  • -M char_margin : Speficies the char margin.
  • -W word_margin : Speficies the word margin.
  • -L line_margin : Speficies the line margin.
  • -F boxes_flow : Speficies the box flow ratio.
  • -d : Turns on Debug output.

dumppdf.py

dumppdf.py is used for debugging PDFs. It dumps all the internal contents in pseudo-XML format.

> dumppdf.py [-P password] [-a] [-p pageid] [-i objid]
             [-o output] [-r|-b|-t] [-T] [-O directory] [-d]
             input.pdf ...
  • -P password : PDF password.
  • -a : Extracts all objects.
  • -p pageid : Extracts a Page object.
  • -i objid : Extracts a certain object.
  • -o output : Output file name.
  • -r : Raw mode. Dumps the raw compressed/encoded streams.
  • -b : Binary mode. Dumps the uncompressed/decoded streams.
  • -t : Text mode. Dumps the streams in text format.
  • -T : Tagged mode. Dumps the tagged contents.
  • -O output_dir : Output directory for extracted streams.

TODO

  • Replace STRICT variable with something better.
  • Improve the debugging functions.
  • Use logging module instead of sys.stderr.
  • Proper test cases.
  • PEP-8 and PEP-257 conformance.
  • Better documentation.
  • Crypto stream filter support.

Related Projects

Comments
  • AttributeError: 'FileUnicodeMap' object has no attribute 'add_code2cid'

    AttributeError: 'FileUnicodeMap' object has no attribute 'add_code2cid'

    I am getting an exception when I try to process_page on the following PDF: https://www.docketalarm.com/cases/PTAB/IPR2014-00396/Inter_Partes_Review_of_U.S._Pat._7310111/docs/02-20-2014-POR-1773/Power_of_Attorney-5-Power_of_Attorney.pdf

    The PDF is digitally signed and I bet that has something to do with it. I don't understand digital signatures or this code well enough to debug it. If you can spot the issue quickly, that would be great.

    Stack trace:

     File "..\libs\pdfminer\pdfinterp.py", line 757, in process_page
        self.render_contents(page.resources, page.contents, ctm=ctm)
      File "..\libs\pdfminer\pdfinterp.py", line 770, in render_contents
        self.execute(list_value(streams))
      File "..\libs\pdfminer\pdfinterp.py", line 795, in execute
        func(*args)
      File "..\libs\pdfminer\pdfinterp.py", line 733, in do_Do
        interpreter.render_contents(resources, [xobj], ctm=mult_matrix(matrix, self.ctm))
      File "..\libs\pdfminer\pdfinterp.py", line 768, in render_contents
        self.init_resources(resources)
      File "..\libs\pdfminer\pdfinterp.py", line 339, in init_resources
        self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
      File "..\libs\pdfminer\pdfinterp.py", line 193, in get_font
        font = self.get_font(None, subspec)
      File "..\libs\pdfminer\pdfinterp.py", line 184, in get_font
        font = PDFCIDFont(self, spec)
      File "..\libs\pdfminer\pdffont.py", line 637, in __init__
        CMapParser(self.unicode_map, StringIO(strm.get_data())).run()
      File "..\libs\pdfminer\cmapdb.py", line 292, in run
        self.nextobject()
      File "..\libs\pdfminer\psparser.py", line 584, in nextobject
        self.do_keyword(pos, token)
      File "..\libs\pdfminer\cmapdb.py", line 354, in do_keyword
        self.cmap.add_code2cid(x, cid+i)
    AttributeError: 'FileUnicodeMap' object has no attribute 'add_code2cid'
    
    opened by speedplane 6
  • License file?

    License file?

    It would be great if this included a license file. According to the MIT license terms, one is suppose to include the license file with the software. However, it is a bit challenging to do that without actually having the license file somewhere. Is it possible we could add it to this repo?

    cc @pmlandwehr

    opened by jakirkham 5
  • pdfminer failed to get all text of the pdf file

    pdfminer failed to get all text of the pdf file

    Dear sir: I use pdf2txt to get the txt of the pdf file,but it only get some parts of the txt. I don't know how to solve this problem.Could you please give me some advise. Thanks a lot. QQ截图20191209093742

    opened by BigPandaCPU 4
  • Question: Can pdfminer retrieve text & bboxes without layout?

    Question: Can pdfminer retrieve text & bboxes without layout?

    Is it possible to just retrieve all the text on the page with each fragment returned with its bounding box, i.e., (x1, y1, x2, y2, text) -- with no layout analysis? Use case: this would be ideal for people who want to do their own layout analysis.

    opened by mark-summerfield 4
  • Wrong Conversion pdf2text for PDF generated by Google Docs

    Wrong Conversion pdf2text for PDF generated by Google Docs

    Please check these following files to see the bug. PDF: https://www.dropbox.com/s/arpkkzvi9e7evfc/Untitleddocument2.pdf text: https://www.dropbox.com/s/g4jq9t7taahdgce/googledocs2.txt I used command: pdf2txt.py Untitleddocument2.pdf > googledocs2.txt to convert pdf document (generated by Google Docs service) and the output is the text file which shows bad content.

    opened by hugo53 4
  • Latest build from PyPI doesn't have process_pdf

    Latest build from PyPI doesn't have process_pdf

    from pdfminer.pdfinterp import PDFResourceManager, process_pdf Traceback (most recent call last): File "", line 1, in ImportError: cannot import name process_pdf

    This worked in the version prior to the one uploaded on 2013-11-13

    opened by cglewis 4
  • Transfer ownership of project

    Transfer ownership of project

    This repo has been idle for more than a year, despite many community members' interest.

    If you're not interested in maintaining the project, transfer it to someone else who is interested.

    opened by brechin 4
  • Removed 341 unnecessary empty 'return' statements

    Removed 341 unnecessary empty 'return' statements

    Python doesn't require return statements at the end of functions and methods, and I noticed pdfminer had many such unnecessary returns. I went through and removed 341 unnecessary statements. Specifically:

    • Removed all return statements that were the last statement in a function or method.
    • For any return statements that were the only statement in a function, converted them to a pass.

    Overall this makes it a simpler, more readable codebase, and it's much more Pythonic.

    opened by adrianholovaty 3
  • pdfminer vs PyPDF2 parsing speed

    pdfminer vs PyPDF2 parsing speed

    So i used the pdfminer lib and its functional, but sadly there is one big problem, which makes this lib completly irrelevant for me. It is too slow. I'll give you an example from: http://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/ using this free PDF: https://web.stanford.edu/~jurafsky/slp3/edbook_oct162019.pdf

    import io
     
    from pdfminer.converter import TextConverter
    from pdfminer.pdfinterp import PDFPageInterpreter
    from pdfminer.pdfinterp import PDFResourceManager
    from pdfminer.pdfpage import PDFPage
     
    def extract_text_by_page(pdf_path):
        with open(pdf_path, 'rb') as fh:
            for page in PDFPage.get_pages(fh, 
                                          caching=True,
                                          check_extractable=True):
                resource_manager = PDFResourceManager()
                fake_file_handle = io.StringIO()
                converter = TextConverter(resource_manager, fake_file_handle)
                page_interpreter = PDFPageInterpreter(resource_manager, converter)
                page_interpreter.process_page(page)
     
                text = fake_file_handle.getvalue()
                yield text
     
                # close open handles
                converter.close()
                fake_file_handle.close()
     
    def extract_text(pdf_path):
        for page in extract_text_by_page(pdf_path):
            print(page)
            print()
     
    if __name__ == '__main__':
        extract_text('edbook_oct162019.pdf')
    

    This script takes about 54,8s for parsing one document. While the same implementation with PyPDF2 just takes 11,3s.

    I am planning to parse 1000 to 10000 PDFs and PyPDF seems to be 5 times faster, so its the obvious choice here.

    Can you elaborate on this?

    opened by TobiasJu 2
  • ERROR: Could not find a version that satisfies the requirement pycryptdome

    ERROR: Could not find a version that satisfies the requirement pycryptdome

    I am getting this error,

    ERROR: Could not find a version that satisfies the requirement pycryptdome (from PDFMiner->-r requirements.txt (line 38)) (from versions: none) ERROR: No matching distribution found for pycryptdome (from PDFMiner->-r requirements.txt (line 38))

    opened by ishtiyaq 2
  • Is it possible to extract the hyperlinks?

    Is it possible to extract the hyperlinks?

    The PyPDF2 package can read hyperlinks from PDF files.

    from PyPDF2 import PdfFileReader
    doc = PdfFileReader(open(file, "rb"))
    annots = [page.get('/Annots', []) for page in doc.pages]
    annots = reduce(lambda x, y: x + y, annots)
    links = [note.get('/A', {}).get('/URI') for note in annots]
    

    However, PyPDF2 does not do a good job on extracting text. Ideally, I want to extract the hyperlinks and their corresponding texts.

    opened by badbye 2
  • update fmttype as 6

    update fmttype as 6

    without fmttype as 6, pdfminer was unable to read the file. Threw an exception as assert False, str(('Unhandled', 6))

    Please accept this, to avoid any issues on formats of such values

    opened by GoelPri 1
  • Unable to decode PDFobjRef in metadata of the PDF file

    Unable to decode PDFobjRef in metadata of the PDF file

    hi, I am facing this error, but unfortunately i cant modify the pdf file, so i need to handle this programatically, Could you guide me if you have resolved it? My metadata has this as a field value: {'q': PDFObjRef:65, 'Q': PDFObjRef:64} and after i resolve it , it converts to {'q': <PDFStream(65): raw=3, {'Length': 3}>, 'Q': <PDFStream(64): raw=3, {'Length': 3}>} I am not sure how to proceed with this.

    opened by reema-dass26 0
  • wigth -O output paramter thrown code error

    wigth -O output paramter thrown code error

    python3.7,pip install pdfminer
    
    
    then 
    
    

    python tools/pdf2txt -O output input.pdf

    File "/usr/local/anaconda3/envs/py37/lib/python3.7/site-packages/pdfminer/image.py", line 74, in export_image
        if len(filters) == 1 and filters[0][0] in LITERALS_DCT_DECODE:
    TypeError: object of type 'zip' has no len()
    
    opened by yangboz 0
  • Unable to differentiate between newline and wrapped text for a table in pdf

    Unable to differentiate between newline and wrapped text for a table in pdf

    There are 2 different hashes present in attached pdf file but while parsing, PDF Miner separates both a new line and wrapped hash text with ‘\n’ which makes it difficult to handle while extracting hashes from a file.

    opened by haritas-crest 1
  • Can't extract text objects

    Can't extract text objects

    Hi,

    When using pdfminer.six to extract text elements from a pdf file, I found that it doesn't work in some cases.

    Pdf files: 2022 Mar quarterly report_ Ali.pdf SIA_AR_2021.pdf

    Description:

    • File 1: can't extract text, however, it's able to extract text when we convert the original pdf file to a printed pdf.
    • File 2: can't extract only part of the text.

    Code which is used:

    
      def get_page_layout(
          filename,
          line_overlap=0.5,
          char_margin=1.0,
          line_margin=0.5,
          word_margin=0.1,
          boxes_flow=0.5,
          detect_vertical=True,
          all_texts=True,
      ):
          """Returns a PDFMiner LTPage object and page dimension of a single
          page pdf. To get the definitions of kwargs, see
          https://pdfminersix.rtfd.io/en/latest/reference/composable.html.
          Parameters
          ----------
          filename : string
              Path to pdf file.
          line_overlap : float
          char_margin : float
          line_margin : float
          word_margin : float
          boxes_flow : float
          detect_vertical : bool
          all_texts : bool
          Returns
          -------
          layout : object
              PDFMiner LTPage object.
          dim : tuple
              Dimension of pdf page in the form (width, height).
          """
          with open(filename, "rb") as f:
              parser = PDFParser(f)
              document = PDFDocument(parser)
              if not document.is_extractable:
                  raise PDFTextExtractionNotAllowed(
                      f"Text extraction is not allowed: {filename}"
                  )
              laparams = LAParams(
                  line_overlap=line_overlap,
                  char_margin=char_margin,
                  line_margin=line_margin,
                  word_margin=word_margin,
                  boxes_flow=boxes_flow,
                  detect_vertical=detect_vertical,
                  all_texts=all_texts,
              )
              rsrcmgr = PDFResourceManager()
              device = PDFPageAggregator(rsrcmgr, laparams=laparams)
              interpreter = PDFPageInterpreter(rsrcmgr, device)
              for page_num, page in enumerate(PDFPage.create_pages(document)):
                  interpreter.process_page(page)
                  layout = device.get_result()
                  width = layout.bbox[2]
                  height = layout.bbox[3]
                  dim = (width, height)
              return layout, dim
      
      
      def get_text_objects(layout, ltype="char", t=None):
          """Recursively parses pdf layout to get a list of
          PDFMiner text objects.
          Parameters
          ----------
          layout : object
              PDFMiner LTPage object.
          ltype : string
              Specify 'char', 'lh', 'lv' to get LTChar, LTTextLineHorizontal,
              and LTTextLineVertical objects respectively.
          t : list
          Returns
          -------
          t : list
              List of PDFMiner text objects.
          """
          if ltype == "char":
              LTObject = LTChar
          elif ltype == "image":
              LTObject = LTImage
          elif ltype == "horizontal_text":
              LTObject = LTTextLineHorizontal
          elif ltype == "vertical_text":
              LTObject = LTTextLineVertical
          if t is None:
              t = []
          try:
              for obj in layout._objs:
                  if isinstance(obj, LTObject):
                      t.append(obj)
                  else:
                      t += get_text_objects(obj, ltype=ltype)
          except AttributeError:
              pass
          return t
    
    opened by tuyenta 0
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

null 1 Nov 30, 2021
PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

null 9 Jan 30, 2022
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox module of the Tkinter library.

RISHABH MISHRA 1 Feb 13, 2022
DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

Frédéric BISSON 6 Jul 27, 2022
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 7, 2022
A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

Martin D. 394 Dec 30, 2022
Performing the following operations using python on PDF.

Python PDF Handling Tutorial Python is a highly versatile language with a huge set of libraries. It is a high level language with simple syntax. Pytho

Prajwol Lamichhane 131 Dec 16, 2022
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 2, 2022
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 1, 2023
Python lib for Simple PDF text extraction

Python lib for Simple PDF text extraction

Jason Alan Palmer 651 Jan 1, 2023
x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

Free Law Project 73 Dec 19, 2022
This book will take you on an exploratory journey through the PDF format, and the borb Python library.

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

Joris Schellekens 281 Jan 1, 2023
Simple python tool created for downloading PDF.

PDFdownloader Usage Open PDF in full-screen mode Run scan.exe Enter how many pages you want to scan Focus PDF After scanning is done, run merge.exe En

null 5 Oct 27, 2021
A simple pdf size compressing telegram robot witten in python.

Pdf Compressor Telegram Bot ##About : A simple pdf size compressing telegram robot witten in python. Mostly useful for digital documentation. Deploy t

Renjith Mangal 22 Oct 28, 2022
Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

Hemachandran P 1 Nov 9, 2021
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

null 1.8k Dec 29, 2022
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

null 1.6k Jan 3, 2023