Camelot is a Python library that can help you extract tables from PDFs!

Overview

Camelot: PDF Table Extraction for Humans

tests Documentation Status codecov.io image image image Gitter chat image

Camelot is a Python library that can help you extract tables from PDFs!

Note: You can also check out Excalibur, the web interface to Camelot!


Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables

>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]

>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings
Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

Camelot also comes packaged with a command-line interface!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

You can check out some frequently asked questions here.

Why Camelot?

  • Configurability: Camelot gives you control over the table extraction process with tweakable settings.
  • Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
  • Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.

See comparison with similar libraries and tools.

Support the development

If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.

Installation

Using conda

The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:

$ pip install "camelot-py[base]"

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[base]"

Documentation

The documentation is available at http://camelot-py.readthedocs.io/.

Wrappers

Contributing

The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Comments
  • AttributeError from PDFMiner

    AttributeError from PDFMiner

    @igormp

    Although I can't upload the bad PDF due to NDA reasons, this issue is well documented here, along with some solutions to it, and there's even a PR in place to fix that, but there seems to be no maintainer available to merge it.

    I'm not sure how this should be handled, since it's a PDFMiner problem, which seems to be unmaintained, and that reflects directly on camelot.

    bug 
    opened by vinayak-mehta 15
  • Doc enhancement: Note dependency on libgs.so (libgs.dylib on Mac) for ghostscript

    Doc enhancement: Note dependency on libgs.so (libgs.dylib on Mac) for ghostscript

    The Camelot documentation highlights a dependency on Ghostscript and adds a check that confirms that the Ghostscript binary is installed. The key dependency for Camelot to run successfully is on a working copy of the libgs library (libgs.dylib for MacOS).

    Specific ask: Would ask that a note be added to the documentation that in addition to running the gs binary for version info, add a note that states you require a full distribution ghostscript that includes the libraries and fonts.

    Details:

    I performed the following steps:

    • Installed the Camelot python package with no issues
    • Installed the conda-forge Ghostscript package with no issues
    • Ran the conda-forge gs per the Camelot docs with no issues

    Using the conda Ghostscript was a mistake and the Camelot documentation suggests using Homebrew toolchain. But I thought I would be good with using the conda Ghostscript but when ran a test script I got the following error:

    >>> import camelot
    >>> tables = camelot.read_pdf('foo.pdf')
    Traceback (most recent call last):
      File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 260, in <module>
        libgs = cdll.LoadLibrary("libgs.so")
      File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/ctypes/__init__.py", line 442, in LoadLibrary
        return self._dlltype(name)
      File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/ctypes/__init__.py", line 364, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError: dlopen(libgs.so, 6): image not found
    Steps to reproduce the behavior:
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/io.py", line 117, in read_pdf
        **kwargs
      File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/handlers.py", line 172, in parse
        p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
      File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/parsers/lattice.py", line 402, in extract_tables
        self._generate_image()
      File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/parsers/lattice.py", line 211, in _generate_image
        from ..ext.ghostscript import Ghostscript
      File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/ext/ghostscript/__init__.py", line 24, in <module>
        from . import _gsprint as gs
      File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 267, in <module>
        raise RuntimeError("Please make sure that Ghostscript is installed")
    RuntimeError: Please make sure that Ghostscript is installed
    

    Looking at the conda package for Ghostscript I determined it only delivered the userland binaries and not the fonts and libraries. I have opened an issue with conda packaging team and asked that the binaries and fonts be delivered.

    Workaround: Install Ghostscript using the Homebrew tool chain

    Owners of Camelot may argue (rightfully) this is a case of "pilot error / not following the docs". Just would suggest adding a note might prevent what looks like a common pilot error situation (See here and here. Homebrew dependency is pretty heavy weight lift also (you need Xcode for Homebrew to work, so a lot of stuff to download/configure to get going with Camelot).

    Thanks for a great tool!

    Environment

    • macOS Catalina 10.15.6
    • Python version: Python 3.7.9 (default, Aug 31 2020, 07:22:35)
    • Ghostscript version: 9.22
    • Camelot version: 0.8.2
    bug 
    opened by jimhall 12
  • ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfpage'.

    ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfpage'.

    import camelot Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.7/site-packages/camelot/init.py", line 6, in from .io import read_pdf File "/usr/local/lib/python3.7/site-packages/camelot/io.py", line 5, in from .handlers import PDFHandler File "/usr/local/lib/python3.7/site-packages/camelot/handlers.py", line 9, in from .parsers import Stream, Lattice File "/usr/local/lib/python3.7/site-packages/camelot/parsers/init.py", line 3, in from .stream import Stream File "/usr/local/lib/python3.7/site-packages/camelot/parsers/stream.py", line 10, in from .base import BaseParser File "/usr/local/lib/python3.7/site-packages/camelot/parsers/base.py", line 5, in from ..utils import get_page_layout, get_text_objects File "/usr/local/lib/python3.7/site-packages/camelot/utils.py", line 17, in from pdfminer.pdfpage import PDFTextExtractionNotAllowed ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfpage' (/usr/local/lib/python3.7/site-packages/pdfminer/pdfpage.py)

    After build project I see this error. Also, I see that package pdfminer.six-20200720 did release today.

    bug 
    opened by Oleh-Hrebchuk 10
  • Report a Bug in 'table_regions'

    Report a Bug in 'table_regions'

    Describe the bug The parameters for 'table_regions' could not work when using Camelot read a PDF.

    When use 'table_regions', I would only get a 'QUANTAL' result: a. read the whole table and the whole PDF; b. or reoprt 'ZeroDivisionError: float division by zero'. (I'm sure I use the correct bbox rules)

    Steps to reproduce the bug Steps used to install camelot:

    Steps to reproduce the behavior: Add step here (you can add more steps too) a. Read a PDF with complex tables by Camelot; b. Set flavor = 'stream', and parameters for 'table_regions'; c. Print the table/tables (type: dataframe) with changing the parameters for 'table_regions'. d. Got the result: read the whole PDF (even though the parameters just represent parts of the PDF) or read nothing.

    Expected behavior A clear and concise description of what you expected to happen. Read different parts when I change the parameters for for 'table_regions'.

    Code Add the Camelot code snippet that you used.

    import camelot
    
    # add your code here
    

    table = camelot.read_pdf('PATH', flavor = 'stream', table_regions = ['x1,y1,x2,y2']) print(table[0].df]

    PDF Add the PDF file that you want to extract tables from. I could not do this since it's confidential.

    Screenshots If applicable, add screenshots to help explain your problem.

    Environment

    • OS: MacOS Big Sur 11.4
    • Python version: 3.9
    • Numpy version: 1.20.3
    • OpenCV version: 4.5.2.52
    • Ghostscript version: 0.7
    • Camelot version: 0.8.2

    Additional context Add any other context about the problem here.

    bug 
    opened by Yichen0975 6
  • UnicodeEncodeError when using Stream flavor

    UnicodeEncodeError when using Stream flavor

    Python 3.7 on Windows

    Using this pdf: http://tsbde.texas.gov/78i8ljhbj/Fiscal-Year-2014-Disciplinary-Actions.pdf

    I am running it through Camelot to convert to html using Stream flavor and I get the following error at execution of the export line, once it reaches page 4 of 8:

    "UnicodeEncodeError -'charmap' codec can't encode character '\u2010' in position y: character maps to undefined."

    Pages 1 through 3 get converted nicely - it crashes somewhere between page 4 and 5. In debug with the breakpoint after the tables.export line, it also brings me to line 19 of cp1252.py, if that's helpful.

    I am on Windows, and this seems not to be an issue on Mac. But Windows is our environment so I have to figure this out. I have done a ton of research on this error and everything for this in Python world points to either adding encoding="utf-8" or errors="ignore", but those both relate to the file.read method and can't be used in Camelot's export method.

    Any thoughts on what I could add to the script to get around this error? We can't avoid using Windows, and this seems to be the final blocker for us for being able to really make great use of this tool for our PDF's.

    opened by stpete111 6
  • ValueError: max() arg is an empty sequence

    ValueError: max() arg is an empty sequence

    When running on this document (https://www.qao.qld.gov.au/sites/qao/files/annual-reports/annual_report_2016-17.pdf), when it reaches page 4, it throws the following ValueError:

    import camelot camelot.read_pdf(path, pages='3', flavor='stream')

    Traceback (most recent call last): File "", line 2, in File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\io.py", line 117, in read_pdf **kwargs File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\handlers.py", line 172, in parse p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 458, in extract_tables cols, rows = self._generate_columns_and_rows(table_idx, tk) File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 349, in _generate_columns_and_rows ncols = max(set(elements), key=elements.count) ValueError: max() arg is an empty sequence

    Easy enough to capture with a try/except but thought I would pop it up here to let you know Thanks for writing this package, excellent work!

    opened by SebastianDeLaile 6
  • Add pdftopng and use ghostscript as fallback

    Add pdftopng and use ghostscript as fallback

    Raised this PR just to run tests. This change would have to be handled in a backwards compatible way by adding support for multiple image conversion backends which the user could specify (ghostscript, mupdf, etc.), and setting pdftopng as the default one.

    opened by vinayak-mehta 5
  • [MRG] Use correct re.sub signature

    [MRG] Use correct re.sub signature

    text_strip currently passes the regex flags as the count parameters, which is hardcoded to re.UNICODE (value 32), and thus only replaces the first 32 values.

    see https://docs.python.org/3/library/re.html#re.sub for the signature

    opened by pevisscher 5
  • Fixed library discovery on Windows

    Fixed library discovery on Windows

    When gsdll.dll is not found in the usual place in the Windows registry, ctypes.util.find_library() is used as a fallback to search %PATH%, before raising RuntimeError.

    opened by KOLANICH 5
  • parsing PDF with rows in different sizes

    parsing PDF with rows in different sizes

    Hello, I'm trying to parse this PDF been a while, and tried almost all configs, even been read the code to find a solution. pdf

    as you can see, the date at first column is not at the top of the cell, so the table get parsed like this: the ... are the cells that I had to censor

    | | 0 | 1 | 2 | 3 | 4 | |---- |----------------- | ------ | ----------------- | --------------- | --------------- | | 0 | 30/07/2021| ... | 052309 | 180,00 | 258.929,59 | | 1 | | ... | | | | | 2 | | ... | | | | | 3 | 30/07/2021 | ... | 041165 | 455,89 | 259.385,48 | | 4 | | ... | | | | | 5 | | ... | | | | | 6 | 30/07/2021 | ... | 052440 | 180,00 | 259.565,48 | | 7 | | ... | | | | | 8 | | ... | | | | | 9 | 30/07/2021 | ... | 052234 | 180,00 | 259.745,48 | | 10 | | ... | | | | | 11 | | ... | | | | | 12 | 30/07/2021 | ... | 863314 | 202,17 | 259.947,65 | | 13 | | ... | | | | | 14 | | ... | | | | | 15 | 30/07/2021 | ... | 875321 | 15,00 | 259.962,65 | | 16 | | ... | | | | | 17 | | ... | | | | | 18 | 30/07/2021 | | 224723423 | 576,25 | 260.538,90 | | 19 | | ... | | | | | 20 | | ... | | | | | 21 | 30/07/2021 | ... | 873665 | 30,00 | 260.568,90 | | 22 | | ... | | | | | 23 | | ... | | | | | 24 | 30/07/2021 | ... | SI01053 | -15.000,00 | 245.568,90 | | 25 | | ... | | | | | 26 | 30/07/2021 | ... | | -309,79 | 245.259,11 | | 27 | 30/07/2021 | ... | | -3.120,10 | 242.139,01 | | 28 | 30/07/2021 | ... | | -141,48 | 241.997,53 | | 29 | 30/07/2021 | ... | | -3.089,90 | 238.907,63 | | 30 | 30/07/2021 | ... | | -1.150,99 | 237.756,64 | | 31 | 30/07/2021 | ... | | -383,08 | 237.373,56 | | 32 | 30/07/2021 | ... | | -9.456,24 | 227.917,32 | | 33 | 30/07/2021 | ... | | -570,00 | 227.347,32 | | 34 | 30/07/2021 | ... | | -820,00 | 226.527,32 | | 35 | 30/07/2021 | ... | | -1.487,99 | 225.039,33 | | 36 | 30/07/2021 | ... | | -1.021,67 | 224.017,66 | | 37 | 30/07/2021 | ... | | -965,00 | 223.052,66 | | 38 | 30/07/2021 | ... | | -871,12 | 222.181,54 | | 39 | 30/07/2021 | ... | | -2.441,50 | 219.740,04 | | 40 | | ... | | | | | 41 | 30/07/2021 | | SEFAZMT-C | -933,10 | 218.806,94 | | 42 | | ... | | | | | 43 | | ... | | | | | 44 | 30/07/2021 | | DARF81COO | -15.037,94 | 203.769,00 | | 45 | | ... | | | | | 46 | | ... | | | | | 47 | | ... | | | | | 48 | 30/07/2021 | | SI01268 | -3.946,03 | 199.822,97 | | 49 | | ... | | | | | 50 | | ... | | | | | 51 | | ... | | | | | 52 | | ... | | | | | 53 | 30/07/2021 | | I01299 | -1.877,45 | 197.945,52 | | 54 | | ... | | | | | 55 | | ... | | | | | 56 | | ... | | | | | 57 | 30/07/2021 | ... | I01307 | -2.636,85 | 195.308,67 | | 58 | | ... | | | | | 59 | | ... | | | | | 60 | 30/07/2021 | |DARF81COO | -8.485,45 | 186.823,22 | | 61 | | ... | | | | | 62 | | ... | | | | | 63 | | ... | | | | | 64 | 30/07/2021 | | CX198172 | 50,00 |186.873,22 | | 65 | | ... | | | | | 66 | | ... | | | | | 67 | | ... | | | |

    sorry for the long table, that was the way I found to it be clear

    so, is there a way to parse the second column all together in a row with the others? the row_tol doesn't work as the rows have different sizes across the pages

    It'll be great if there's a way to join them in row by it's colors, as the rows are in a striped style.

    Thanks in advance!

    opened by alissonsv 4
  • How gsdll on Windows?

    How gsdll on Windows?

    Hi team,

    Really need help here. https://stackoverflow.com/questions/69064465/how-to-feed-ghostscript-dll-library-to-python-in-windows I have installed the ghostscript app for windows, but Python still does not "see" it. Details by the link.

    Thanks in advance

    opened by andkirby 4
  • We need more maintainers

    We need more maintainers

    It seems like camelot is dead:

    • Last commit: 2021-07-11 - @dimitern is the only other project owner besides @vinayak-mehta
    • Last PyPI release: 2021-07-11 - @vinayak-mehta is the only owner
    • Several PRs which look ready to be merged, but are still open

    Besides the owner there are only 35 other contributors.

    https://opencollective.com/camelot might be another way to check if it's dead.

    Does anybody know more? Should we try to transfer the project to https://github.com/jazzband ?

    project-governance 
    opened by MartinThoma 2
  • While uploading PDF Camelot is unable to read its content

    While uploading PDF Camelot is unable to read its content

    While trying to read a PDF using Camelot it is unable to read its table, I am getting only 0th column data and nothing else. Steps to reproduce the bug

    Screenshots image

    image

    Environment

    • OS: Windows
    • Python version: 3.9.12
    • Numpy version: 1.22.3
    • OpenCV version: 4.5.5.64
    • Ghostscript version: 0.7
    • Camelot version: 0.10.1

    Link for PDF

    https://www.irf.com/product-info/datasheets/data/irhm9150.pdf

    bug 
    opened by saidakyuz 0
  • No module named ghostscript

    No module named ghostscript

    After following Camelot instructions (and a few other dead ends), python is unable to find ghostscript module.

    Suggest:

    1. Adding a check during install if the ghostscript python api is installed.
    2. Updating instructions -and-/or install process if appropriate

    Steps to reproduce the bug

    Installation:

    1. brew update; brew upgrade; # Upgrade and update homebrew
    2. brew install ghostscript
    3. conda -v -n my_env -c conda-forge camelot-py
      • Note, this also appears to install anacondas ghostscript!

    Steps to be used to reproduce behavior:

    1. python3 -c "from ctypes.util import find_library; print(find_library(\"gs\"))" # Outside of conda env
    /usr/local/lib/libgs.dylib@ -> ../Cellar/ghostscript/10.0.0/lib/libgs.dylib
    
    1. python3 -c "import ghostscript"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    ModuleNotFoundError: No module named 'ghostscript'
    
    1. conda activate my_env
    2. python -c "from ctypes.util import find_library; print(find_library(\"gs\"))" # Per Camelot Docs
    /Users/<username>/opt/anaconda3/envs/finance/bin/../lib/libgs.dylib
    
    1. python -c "import ghostscript"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    ModuleNotFoundError: No module named 'ghostscript'
    

    Expected behavior

    1. ghostscript imports succesffully
    2. OR ... some sort of error is thrown during install to notify the user of missing deps

    Code

    import camelot
    tables = camelot.read_pdf("./example.pdf")
    

    Environment

    • OS: macOS 12.4
      • uname -a
      • Darwin mylappy.local 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 x86_64
    • Python version (Conda complete dev env): 3.9.15
    • Python version (Conda standalone env): 3.11.0
    • Python version (System): 3.10.9
    • Numpy version (Conda complete dev env): 1.24.0
    • Numpy version (Conda standalone env): 1.24.1
    • OpenCV version (Conda, both envs): 4.6.0
    • Ghostscript version (Conda, both envs): 9.54
    • Ghostscript version (System): 10.0
    • Camelot version (Conda, both envs): 0.10.1

    Additional context

    I have a development environment in conda with more deps, and also replicated with a fresh env, hopefully the dilineartion is clear in the environment specs.

    bug 
    opened by HepaxCodex 0
  • PdfFileReader is deprecated and was removed in PyPDF2 3.0.0

    PdfFileReader is deprecated and was removed in PyPDF2 3.0.0

    Describe the bug

    Version 3.0.0 of PyPDF2 was just released today (23 Dec 2022), which includes a breaking change for removing PdfFileReader (see changelog). As a result, all new installs and usage of camelot-py will raise the following exception:

    Traceback (most recent call last):
      File "test.py", line 9, in <module>
        camelot.read_pdf(PDF_FILE_PATH)
      File ".venv/py37/lib/python3.7/site-packages/camelot/io.py", line 117, in read_pdf
        **kwargs
      File ".venv/py37/lib/python3.7/site-packages/camelot/handlers.py", line 172, in parse
        self._save_page(self.filepath, p, tempdir)
      File ".venv/py37/lib/python3.7/site-packages/camelot/handlers.py", line 111, in _save_page
        infile = PdfFileReader(fileobj, strict=False)
      File ".venv/py37/lib/python3.7/site-packages/PyPDF2/_reader.py", line 1974, in __init__
        deprecation_with_replacement("PdfFileReader", "PdfReader", "3.0.0")
      File ".venv/py37/lib/python3.7/site-packages/PyPDF2/_utils.py", line 369, in deprecation_with_replacement
        deprecation(DEPR_MSG_HAPPENED.format(old_name, removed_in, new_name))
      File ".venv/py37/lib/python3.7/site-packages/PyPDF2/_utils.py", line 351, in deprecation
        raise DeprecationError(msg)
    PyPDF2.errors.DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.
    

    Steps to reproduce the bug

    1. Create a new virtualenv
    2. Install camelot-py:
      pip install camelot-py[base]
      
    3. Run the following code:
      	import camelot
      
      	# replace with a valid path on your local filesystem
      	PDF_FILE_PATH = "/path/to/file.pdf"
      
      	# raises an exception from PyPDF2
      	camelot.read_pdf(PDF_FILE_PATH)
      

    Expected behavior

    The code above should execute without any exceptions.

    Environment

    • OS: macOS 12.3.1
    • Python version: 3.7
    • Numpy version: 1.24.0
    • OpenCV version: 4.6.0.66
    • Ghostscript version: 0.7
    • Camelot version: 0.10.1
    bug 
    opened by szeswee 20
  • Camelot returns tables that contain no text (Where text should be detectable)

    Camelot returns tables that contain no text (Where text should be detectable)

    I'm trying to extract data from some ~900 certificates. These certificates have an identical visual structure, but are published by different parties. For the majority of files the extraction works. However, for several dozen files, the table-structure returned by Camelot contains only empty strings.

    Plotting grid and text shows content is detected (e.g. Table 7 in DS_3663.pdf): DS_3663_table_7_grid

    DS_3663_table_7_text

    I'm using this command to read the pdf and create the tables: >>> tables=camelot.read_pdf('pdfs/DS_3663.pdf', pages='1-end', line_scale=110, shift_text=[''])

    e.g. Table 7 contains this data: >>> tables[7].data [['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', '']]

    Here are a few more example pdfs where the extraction fails in an identical manner: DS_885.pdf DS_2481.pdf DS_2083.pdf

    Parsing all of these files with pdf2txt.py successfully extracts text, so I assume it should be possible to get a result with Camelot as well.

    Environment

    • OS: Ubuntu 22.04.1 LTS
    • Python version: 3.10.6
    • Numpy version: 1.23.4
    • OpenCV version: 4.6.0.66
    • Ghostscript version: 9.55.0
    • Camelot version: 0.9.0

    I've tried debugging this, but had difficulties understanding the intricate code in the bbox-sections. From what I've figured out, it appears to me that Camelot is unable to marry horizontal_text (Which contain the relevant text) with the line-grid.

    bug 
    opened by peletiah 0
Owner
Camelot and Excalibur: PDF Table Extraction for Humans
null
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 7, 2022
Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

Bo-Yu 4 Dec 5, 2021
Pdfencrypt is a tool to encrypt/lock PDFs

Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:

Anontemitayo 5 Nov 28, 2021
pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc

Will Angley 2 Dec 17, 2021
Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.

Marshal Miller 22 Nov 21, 2022
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

null 3 Nov 25, 2021
This book will take you on an exploratory journey through the PDF format, and the borb Python library.

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

Joris Schellekens 281 Jan 1, 2023
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 1, 2023
Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

null 130 Dec 26, 2022
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

Nemo 387 Dec 10, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

Matthew Stamy 5k Jan 4, 2023
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 1, 2023
x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

Free Law Project 73 Dec 19, 2022
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

null 1.6k Jan 3, 2023
Python bindings for MuPDF's rendering library.

PyMuPDF 1.19.3 Release date: December 15, 2021 On PyPI since August 2016: Author Jorj X. McKie, based on original code by Ruikai Liu. Introduction PyM

Jorj X. McKie 0 Nov 3, 2022
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 4, 2023
PyMuPDF is a Python binding with support for MuPDF

PyMuPDF is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.

PyMuPDF 1.9k Jan 3, 2023
A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

Martin D. 394 Dec 30, 2022