Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Yusuke Shinyama

Last update: Jan 4, 2023

Related tags

PDF Files Processing pdfminer

Overview

PDFMiner

PDFMiner is a text extraction tool for PDF documents.

Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but this project is largely dormant. For the active project, check out its fork pdfminer.six.

Features:

Pure Python (3.6 or above).
Supports PDF-1.7. (well, almost)
Obtains the exact location of text as well as other layout information (fonts, etc.).
Performs automatic layout analysis.
Can convert PDF into other formats (HTML/XML).
Can extract an outline (TOC).
Can extract tagged contents.
Supports basic encryption (RC4 and AES).
Supports various font types (Type1, TrueType, Type3, and CID).
Supports CJK languages and vertical writing scripts.
Has an extensible PDF parser that can be used for other purposes.

How to Use:

> pip install pdfminer
> pdf2txt.py samples/simple1.pdf

Command Line Syntax:

pdf2txt.py

pdf2txt.py extracts all the texts that are rendered programmatically. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text segment. It does not recognize text in images. A password needs to be provided for restricted PDF documents.

> pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag]
             [-O output_dir] [-c encoding] [-s scale] [-R rotation]
             [-Y normal|loose|exact] [-p pagenos] [-m maxpages]
             [-S] [-C] [-n] [-A] [-V]
             [-M char_margin] [-L line_margin] [-W word_margin]
             [-F boxes_flow] [-d]
             input.pdf ...

-P password : PDF password.
-o output : Output file name.
-t text|html|xml|tag : Output type. (default: automatically inferred from the output file name.)
-O output_dir : Output directory for extracted images.
-c encoding : Output encoding. (default: utf-8)
-s scale : Output scale.
-R rotation : Rotates the page in degree.
-Y normal|loose|exact : Specifies the layout mode. (only for HTML output.)
-p pagenos : Processes certain pages only.
-m maxpages : Limits the number of maximum pages to process.
-S : Strips control characters.
-C : Disables resource caching.
-n : Disables layout analysis.
-A : Applies layout analysis for all texts including figures.
-V : Automatically detects vertical writing.
-M char_margin : Speficies the char margin.
-W word_margin : Speficies the word margin.
-L line_margin : Speficies the line margin.
-F boxes_flow : Speficies the box flow ratio.
-d : Turns on Debug output.

dumppdf.py

dumppdf.py is used for debugging PDFs. It dumps all the internal contents in pseudo-XML format.

> dumppdf.py [-P password] [-a] [-p pageid] [-i objid]
             [-o output] [-r|-b|-t] [-T] [-O directory] [-d]
             input.pdf ...

-P password : PDF password.
-a : Extracts all objects.
-p pageid : Extracts a Page object.
-i objid : Extracts a certain object.
-o output : Output file name.
-r : Raw mode. Dumps the raw compressed/encoded streams.
-b : Binary mode. Dumps the uncompressed/decoded streams.
-t : Text mode. Dumps the streams in text format.
-T : Tagged mode. Dumps the tagged contents.
-O output_dir : Output directory for extracted streams.

TODO

Replace STRICT variable with something better.
Improve the debugging functions.
Use logging module instead of sys.stderr.
Proper test cases.
PEP-8 and PEP-257 conformance.
Better documentation.
Crypto stream filter support.

Related Projects

Comments

AttributeError: 'FileUnicodeMap' object has no attribute 'add_code2cid'

I am getting an exception when I try to process_page on the following PDF: https://www.docketalarm.com/cases/PTAB/IPR2014-00396/Inter_Partes_Review_of_U.S._Pat._7310111/docs/02-20-2014-POR-1773/Power_of_Attorney-5-Power_of_Attorney.pdf

The PDF is digitally signed and I bet that has something to do with it. I don't understand digital signatures or this code well enough to debug it. If you can spot the issue quickly, that would be great.

Stack trace:

 File "..\libs\pdfminer\pdfinterp.py", line 757, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "..\libs\pdfminer\pdfinterp.py", line 770, in render_contents
    self.execute(list_value(streams))
  File "..\libs\pdfminer\pdfinterp.py", line 795, in execute
    func(*args)
  File "..\libs\pdfminer\pdfinterp.py", line 733, in do_Do
    interpreter.render_contents(resources, [xobj], ctm=mult_matrix(matrix, self.ctm))
  File "..\libs\pdfminer\pdfinterp.py", line 768, in render_contents
    self.init_resources(resources)
  File "..\libs\pdfminer\pdfinterp.py", line 339, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "..\libs\pdfminer\pdfinterp.py", line 193, in get_font
    font = self.get_font(None, subspec)
  File "..\libs\pdfminer\pdfinterp.py", line 184, in get_font
    font = PDFCIDFont(self, spec)
  File "..\libs\pdfminer\pdffont.py", line 637, in __init__
    CMapParser(self.unicode_map, StringIO(strm.get_data())).run()
  File "..\libs\pdfminer\cmapdb.py", line 292, in run
    self.nextobject()
  File "..\libs\pdfminer\psparser.py", line 584, in nextobject
    self.do_keyword(pos, token)
  File "..\libs\pdfminer\cmapdb.py", line 354, in do_keyword
    self.cmap.add_code2cid(x, cid+i)
AttributeError: 'FileUnicodeMap' object has no attribute 'add_code2cid'

opened by speedplane 6

License file?

It would be great if this included a license file. According to the MIT license terms, one is suppose to include the license file with the software. However, it is a bit challenging to do that without actually having the license file somewhere. Is it possible we could add it to this repo?

cc @pmlandwehr

opened by jakirkham 5
pdfminer failed to get all text of the pdf file

Dear sir: I use pdf2txt to get the txt of the pdf file,but it only get some parts of the txt. I don't know how to solve this problem.Could you please give me some advise. Thanks a lot.

opened by BigPandaCPU 4
Question: Can pdfminer retrieve text & bboxes without layout?

Is it possible to just retrieve all the text on the page with each fragment returned with its bounding box, i.e., (x1, y1, x2, y2, text) -- with no layout analysis? Use case: this would be ideal for people who want to do their own layout analysis.

opened by mark-summerfield 4
Wrong Conversion pdf2text for PDF generated by Google Docs

Please check these following files to see the bug. PDF: https://www.dropbox.com/s/arpkkzvi9e7evfc/Untitleddocument2.pdf text: https://www.dropbox.com/s/g4jq9t7taahdgce/googledocs2.txt I used command: pdf2txt.py Untitleddocument2.pdf > googledocs2.txt to convert pdf document (generated by Google Docs service) and the output is the text file which shows bad content.

opened by hugo53 4
Latest build from PyPI doesn't have process_pdf

from pdfminer.pdfinterp import PDFResourceManager, process_pdf Traceback (most recent call last): File "", line 1, in ImportError: cannot import name process_pdf

This worked in the version prior to the one uploaded on 2013-11-13

opened by cglewis 4
Transfer ownership of project

This repo has been idle for more than a year, despite many community members' interest.

If you're not interested in maintaining the project, transfer it to someone else who is interested.

opened by brechin 4
Removed 341 unnecessary empty 'return' statements
Python doesn't require return statements at the end of functions and methods, and I noticed pdfminer had many such unnecessary returns. I went through and removed 341 unnecessary statements. Specifically:

Removed all return statements that were the last statement in a function or method.

For any return statements that were the only statement in a function, converted them to a pass.

Overall this makes it a simpler, more readable codebase, and it's much more Pythonic.
opened by adrianholovaty 3

pdfminer vs PyPDF2 parsing speed

So i used the pdfminer lib and its functional, but sadly there is one big problem, which makes this lib completly irrelevant for me. It is too slow. I'll give you an example from: http://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/ using this free PDF: https://web.stanford.edu/~jurafsky/slp3/edbook_oct162019.pdf

import io
 
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
 
def extract_text_by_page(pdf_path):
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
            converter = TextConverter(resource_manager, fake_file_handle)
            page_interpreter = PDFPageInterpreter(resource_manager, converter)
            page_interpreter.process_page(page)
 
            text = fake_file_handle.getvalue()
            yield text
 
            # close open handles
            converter.close()
            fake_file_handle.close()
 
def extract_text(pdf_path):
    for page in extract_text_by_page(pdf_path):
        print(page)
        print()
 
if __name__ == '__main__':
    extract_text('edbook_oct162019.pdf')

This script takes about 54,8s for parsing one document. While the same implementation with PyPDF2 just takes 11,3s.

I am planning to parse 1000 to 10000 PDFs and PyPDF seems to be 5 times faster, so its the obvious choice here.

Can you elaborate on this?

opened by TobiasJu 2

ERROR: Could not find a version that satisfies the requirement pycryptdome

I am getting this error,

ERROR: Could not find a version that satisfies the requirement pycryptdome (from PDFMiner->-r requirements.txt (line 38)) (from versions: none) ERROR: No matching distribution found for pycryptdome (from PDFMiner->-r requirements.txt (line 38))

opened by ishtiyaq 2
Is it possible to extract the hyperlinks?
The PyPDF2 package can read hyperlinks from PDF files.

from PyPDF2 import PdfFileReader doc = PdfFileReader(open(file, "rb")) annots = [page.get('/Annots', []) for page in doc.pages] annots = reduce(lambda x, y: x + y, annots) links = [note.get('/A', {}).get('/URI') for note in annots]

However, PyPDF2 does not do a good job on extracting text. Ideally, I want to extract the hyperlinks and their corresponding texts.
opened by badbye 2
update fmttype as 6

without fmttype as 6, pdfminer was unable to read the file. Threw an exception as assert False, str(('Unhandled', 6))

Please accept this, to avoid any issues on formats of such values

opened by GoelPri 1
Unable to decode PDFobjRef in metadata of the PDF file

hi, I am facing this error, but unfortunately i cant modify the pdf file, so i need to handle this programatically, Could you guide me if you have resolved it? My metadata has this as a field value: {'q': PDFObjRef:65, 'Q': PDFObjRef:64} and after i resolve it , it converts to {'q': <PDFStream(65): raw=3, {'Length': 3}>, 'Q': <PDFStream(64): raw=3, {'Length': 3}>} I am not sure how to proceed with this.

opened by reema-dass26 0

wigth -O output paramter thrown code error

python3.7,pip install pdfminer


then

python tools/pdf2txt -O output input.pdf

File "/usr/local/anaconda3/envs/py37/lib/python3.7/site-packages/pdfminer/image.py", line 74, in export_image
    if len(filters) == 1 and filters[0][0] in LITERALS_DCT_DECODE:
TypeError: object of type 'zip' has no len()

opened by yangboz 0

Unable to differentiate between newline and wrapped text for a table in pdf

There are 2 different hashes present in attached pdf file but while parsing, PDF Miner separates both a new line and wrapped hash text with ‘\n’ which makes it difficult to handle while extracting hashes from a file.

opened by haritas-crest 1

Can't extract text objects

Hi,

When using pdfminer.six to extract text elements from a pdf file, I found that it doesn't work in some cases.

Pdf files: 2022 Mar quarterly report_ Ali.pdf SIA_AR_2021.pdf

Description:

File 1: can't extract text, however, it's able to extract text when we convert the original pdf file to a printed pdf.
File 2: can't extract only part of the text.

Code which is used:


  def get_page_layout(
      filename,
      line_overlap=0.5,
      char_margin=1.0,
      line_margin=0.5,
      word_margin=0.1,
      boxes_flow=0.5,
      detect_vertical=True,
      all_texts=True,
  ):
      """Returns a PDFMiner LTPage object and page dimension of a single
      page pdf. To get the definitions of kwargs, see
      https://pdfminersix.rtfd.io/en/latest/reference/composable.html.
      Parameters
      ----------
      filename : string
          Path to pdf file.
      line_overlap : float
      char_margin : float
      line_margin : float
      word_margin : float
      boxes_flow : float
      detect_vertical : bool
      all_texts : bool
      Returns
      -------
      layout : object
          PDFMiner LTPage object.
      dim : tuple
          Dimension of pdf page in the form (width, height).
      """
      with open(filename, "rb") as f:
          parser = PDFParser(f)
          document = PDFDocument(parser)
          if not document.is_extractable:
              raise PDFTextExtractionNotAllowed(
                  f"Text extraction is not allowed: {filename}"
              )
          laparams = LAParams(
              line_overlap=line_overlap,
              char_margin=char_margin,
              line_margin=line_margin,
              word_margin=word_margin,
              boxes_flow=boxes_flow,
              detect_vertical=detect_vertical,
              all_texts=all_texts,
          )
          rsrcmgr = PDFResourceManager()
          device = PDFPageAggregator(rsrcmgr, laparams=laparams)
          interpreter = PDFPageInterpreter(rsrcmgr, device)
          for page_num, page in enumerate(PDFPage.create_pages(document)):
              interpreter.process_page(page)
              layout = device.get_result()
              width = layout.bbox[2]
              height = layout.bbox[3]
              dim = (width, height)
          return layout, dim
  
  
  def get_text_objects(layout, ltype="char", t=None):
      """Recursively parses pdf layout to get a list of
      PDFMiner text objects.
      Parameters
      ----------
      layout : object
          PDFMiner LTPage object.
      ltype : string
          Specify 'char', 'lh', 'lv' to get LTChar, LTTextLineHorizontal,
          and LTTextLineVertical objects respectively.
      t : list
      Returns
      -------
      t : list
          List of PDFMiner text objects.
      """
      if ltype == "char":
          LTObject = LTChar
      elif ltype == "image":
          LTObject = LTImage
      elif ltype == "horizontal_text":
          LTObject = LTTextLineHorizontal
      elif ltype == "vertical_text":
          LTObject = LTTextLineVertical
      if t is None:
          t = []
      try:
          for obj in layout._objs:
              if isinstance(obj, LTObject):
                  t.append(obj)
              else:
                  t += get_text_objects(obj, ltype=ltype)
      except AttributeError:
          pass
      return t

opened by tuyenta 0

Owner

Yusuke Shinyama

m33p.

GitHub https://github.com/pdfminer/pdfminer.six

Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

1 Nov 30, 2021

PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

9 Jan 30, 2022

Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

3 Mar 13, 2022

Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox module of the Tkinter library.

1 Feb 13, 2022

DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

6 Jul 27, 2022

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.