Camelot is a Python library that can help you extract tables from PDFs!

Last update: Jan 3, 2023

Related tags

PDF Files Processing camelot

Overview

Camelot: PDF Table Extraction for Humans

Camelot is a Python library that can help you extract tables from PDFs!

Note: You can also check out Excalibur, the web interface to Camelot!

Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables

>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]

>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
 
 
  
   
   Cycle Name 
   KI (1/km) 
   Distance (mi) 
   Percent Fuel Savings 
    
    
    
   
  
  
   
    
    
    
   Improved Speed 
   Decreased Accel 
   Eliminate Stops 
   Decreased Idle 
   
   
   2012_2 
   3.30 
   1.3 
   5.9% 
   9.5% 
   29.2% 
   17.4% 
   
   
   2145_1 
   0.68 
   11.2 
   2.4% 
   0.1% 
   9.5% 
   2.7% 
   
   
   4234_1 
   0.59 
   58.7 
   8.5% 
   1.3% 
   8.5% 
   3.3% 
   
   
   2032_2 
   0.17 
   57.8 
   21.7% 
   0.3% 
   2.7% 
   1.2% 
   
   
   4171_1 
   0.07 
   173.9 
   58.1% 
   1.6% 
   2.1% 
   0.5% 
   
  
 
Camelot also comes packaged with a command-line interface! 
Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".) 
You can check out some frequently asked questions here. 

  Why Camelot? 
 
 Configurability: Camelot gives you control over the table extraction process with tweakable settings. 
 Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table. 
 Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite. 
 
See comparison with similar libraries and tools. 

  Support the development 
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective. 

  Installation 

  Using conda 
The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution. 
$ conda install -c conda-forge camelot-py
 

  Using pip 
After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot: 
$ pip install "camelot-py[base]"
 

  From the source code 
After installing the dependencies, clone the repo using: 
$ git clone https://www.github.com/camelot-dev/camelot
 
and install Camelot using pip: 
$ cd camelot
$ pip install ".[base]"
 

  Documentation 
The documentation is available at http://camelot-py.readthedocs.io/. 

  Wrappers 
 
 camelot-php provides a PHP wrapper on Camelot. 
 

  Contributing 
The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests. 

  Versioning 
Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md. 

  License 
This project is licensed under the MIT License, see the LICENSE file for details.

Cycle Name	KI (1/km)	Distance (mi)	Percent Fuel Savings
			Improved Speed	Decreased Accel	Eliminate Stops	Decreased Idle
2012_2	3.30	1.3	5.9%	9.5%	29.2%	17.4%
2145_1	0.68	11.2	2.4%	0.1%	9.5%	2.7%
4234_1	0.59	58.7	8.5%	1.3%	8.5%	3.3%
2032_2	0.17	57.8	21.7%	0.3%	2.7%	1.2%
4171_1	0.07	173.9	58.1%	1.6%	2.1%	0.5%

Comments

AttributeError from PDFMiner

@igormp

Although I can't upload the bad PDF due to NDA reasons, this issue is well documented here, along with some solutions to it, and there's even a PR in place to fix that, but there seems to be no maintainer available to merge it.

I'm not sure how this should be handled, since it's a PDFMiner problem, which seems to be unmaintained, and that reflects directly on camelot.

bug

opened by vinayak-mehta 15

Doc enhancement: Note dependency on libgs.so (libgs.dylib on Mac) for ghostscript

The Camelot documentation highlights a dependency on Ghostscript and adds a check that confirms that the Ghostscript binary is installed. The key dependency for Camelot to run successfully is on a working copy of the libgs library (libgs.dylib for MacOS).

Specific ask: Would ask that a note be added to the documentation that in addition to running the gs binary for version info, add a note that states you require a full distribution ghostscript that includes the libraries and fonts.

Details:

I performed the following steps:

Installed the Camelot python package with no issues
Installed the conda-forge Ghostscript package with no issues
Ran the conda-forge gs per the Camelot docs with no issues

Using the conda Ghostscript was a mistake and the Camelot documentation suggests using Homebrew toolchain. But I thought I would be good with using the conda Ghostscript but when ran a test script I got the following error:

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
Traceback (most recent call last):
  File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 260, in <module>
    libgs = cdll.LoadLibrary("libgs.so")
  File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/ctypes/__init__.py", line 442, in LoadLibrary
    return self._dlltype(name)
  File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(libgs.so, 6): image not found
Steps to reproduce the behavior:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/io.py", line 117, in read_pdf
    **kwargs
  File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/handlers.py", line 172, in parse
    p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
  File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/parsers/lattice.py", line 402, in extract_tables
    self._generate_image()
  File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/parsers/lattice.py", line 211, in _generate_image
    from ..ext.ghostscript import Ghostscript
  File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/ext/ghostscript/__init__.py", line 24, in <module>
    from . import _gsprint as gs
  File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 267, in <module>
    raise RuntimeError("Please make sure that Ghostscript is installed")
RuntimeError: Please make sure that Ghostscript is installed

Looking at the conda package for Ghostscript I determined it only delivered the userland binaries and not the fonts and libraries. I have opened an issue with conda packaging team and asked that the binaries and fonts be delivered.

Workaround: Install Ghostscript using the Homebrew tool chain

Owners of Camelot may argue (rightfully) this is a case of "pilot error / not following the docs". Just would suggest adding a note might prevent what looks like a common pilot error situation (See here and here. Homebrew dependency is pretty heavy weight lift also (you need Xcode for Homebrew to work, so a lot of stuff to download/configure to get going with Camelot).

Thanks for a great tool!

Environment

macOS Catalina 10.15.6
Python version: Python 3.7.9 (default, Aug 31 2020, 07:22:35)
Ghostscript version: 9.22
Camelot version: 0.8.2

bug

opened by jimhall 12

ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfpage'.

import camelot Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.7/site-packages/camelot/init.py", line 6, in from .io import read_pdf File "/usr/local/lib/python3.7/site-packages/camelot/io.py", line 5, in from .handlers import PDFHandler File "/usr/local/lib/python3.7/site-packages/camelot/handlers.py", line 9, in from .parsers import Stream, Lattice File "/usr/local/lib/python3.7/site-packages/camelot/parsers/init.py", line 3, in from .stream import Stream File "/usr/local/lib/python3.7/site-packages/camelot/parsers/stream.py", line 10, in from .base import BaseParser File "/usr/local/lib/python3.7/site-packages/camelot/parsers/base.py", line 5, in from ..utils import get_page_layout, get_text_objects File "/usr/local/lib/python3.7/site-packages/camelot/utils.py", line 17, in from pdfminer.pdfpage import PDFTextExtractionNotAllowed ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfpage' (/usr/local/lib/python3.7/site-packages/pdfminer/pdfpage.py)

After build project I see this error. Also, I see that package pdfminer.six-20200720 did release today.
bug

opened by Oleh-Hrebchuk 10
Report a Bug in 'table_regions'
Describe the bug The parameters for 'table_regions' could not work when using Camelot read a PDF.

When use 'table_regions', I would only get a 'QUANTAL' result: a. read the whole table and the whole PDF; b. or reoprt 'ZeroDivisionError: float division by zero'. (I'm sure I use the correct bbox rules)

Steps to reproduce the bug Steps used to install camelot:

Steps to reproduce the behavior: Add step here (you can add more steps too) a. Read a PDF with complex tables by Camelot; b. Set flavor = 'stream', and parameters for 'table_regions'; c. Print the table/tables (type: dataframe) with changing the parameters for 'table_regions'. d. Got the result: read the whole PDF (even though the parameters just represent parts of the PDF) or read nothing.

Expected behavior A clear and concise description of what you expected to happen. Read different parts when I change the parameters for for 'table_regions'.

Code Add the Camelot code snippet that you used.

import camelot # add your code here

table = camelot.read_pdf('PATH', flavor = 'stream', table_regions = ['x1,y1,x2,y2']) print(table[0].df]

PDF Add the PDF file that you want to extract tables from. I could not do this since it's confidential.

Screenshots If applicable, add screenshots to help explain your problem.

Environment

OS: MacOS Big Sur 11.4

Python version: 3.9

Numpy version: 1.20.3

OpenCV version: 4.5.2.52

Ghostscript version: 0.7

Camelot version: 0.8.2

Additional context Add any other context about the problem here.
bug
opened by Yichen0975 6
UnicodeEncodeError when using Stream flavor

Python 3.7 on Windows

Using this pdf: http://tsbde.texas.gov/78i8ljhbj/Fiscal-Year-2014-Disciplinary-Actions.pdf

I am running it through Camelot to convert to html using Stream flavor and I get the following error at execution of the export line, once it reaches page 4 of 8:

"UnicodeEncodeError -'charmap' codec can't encode character '\u2010' in position y: character maps to undefined."

Pages 1 through 3 get converted nicely - it crashes somewhere between page 4 and 5. In debug with the breakpoint after the tables.export line, it also brings me to line 19 of cp1252.py, if that's helpful.

I am on Windows, and this seems not to be an issue on Mac. But Windows is our environment so I have to figure this out. I have done a ton of research on this error and everything for this in Python world points to either adding encoding="utf-8" or errors="ignore", but those both relate to the file.read method and can't be used in Camelot's export method.

Any thoughts on what I could add to the script to get around this error? We can't avoid using Windows, and this seems to be the final blocker for us for being able to really make great use of this tool for our PDF's.

opened by stpete111 6
ValueError: max() arg is an empty sequence

When running on this document (https://www.qao.qld.gov.au/sites/qao/files/annual-reports/annual_report_2016-17.pdf), when it reaches page 4, it throws the following ValueError:

import camelot camelot.read_pdf(path, pages='3', flavor='stream')

Traceback (most recent call last): File "", line 2, in File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\io.py", line 117, in read_pdf **kwargs File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\handlers.py", line 172, in parse p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 458, in extract_tables cols, rows = self._generate_columns_and_rows(table_idx, tk) File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 349, in _generate_columns_and_rows ncols = max(set(elements), key=elements.count) ValueError: max() arg is an empty sequence

Easy enough to capture with a try/except but thought I would pop it up here to let you know Thanks for writing this package, excellent work!

opened by SebastianDeLaile 6
Add pdftopng and use ghostscript as fallback

Raised this PR just to run tests. This change would have to be handled in a backwards compatible way by adding support for multiple image conversion backends which the user could specify (ghostscript, mupdf, etc.), and setting pdftopng as the default one.

opened by vinayak-mehta 5
[MRG] Use correct re.sub signature

text_strip currently passes the regex flags as the count parameters, which is hardcoded to re.UNICODE (value 32), and thus only replaces the first 32 values.

see https://docs.python.org/3/library/re.html#re.sub for the signature

opened by pevisscher 5
Fixed library discovery on Windows

When gsdll.dll is not found in the usual place in the Windows registry, ctypes.util.find_library() is used as a fallback to search %PATH%, before raising RuntimeError.

opened by KOLANICH 5
parsing PDF with rows in different sizes

Hello, I'm trying to parse this PDF been a while, and tried almost all configs, even been read the code to find a solution.

as you can see, the date at first column is not at the top of the cell, so the table get parsed like this: the ... are the cells that I had to censor

| | 0 |---- |----------------- | 0 | 30/07/2021| | 1 | | 2 | | 3 | 30/07/2021 | | 4 | | 5 | | 6 | 30/07/2021 | | 7 | | 8 | | 9 | 30/07/2021 | | 10 | | 11 | | 12 | 30/07/2021 | | 13 | | 14 | | 15 | 30/07/2021 | | 16 | | 17 | | 18 | 30/07/2021 | | 19 | | 20 | | 21 | 30/07/2021 | | 22 | | 23 | | 24 | 30/07/2021 | ... | | 25 | | 26 | 30/07/2021 | ... | | 27 | 30/07/2021 | ... | | 28 | 30/07/2021 | ... | | 29 | 30/07/2021 | ... | | 30 | 30/07/2021 | ... | | 31 | 30/07/2021 | ... | | 32 | 30/07/2021 | ... | | 33 | 30/07/2021 | ... | | 34 | 30/07/2021 | ... | | 35 | 30/07/2021 | ... | | 36 | 30/07/2021 | ... | | 37 | 30/07/2021 | ... | | 38 | 30/07/2021 | ... | | 39 | 30/07/2021 | ... | | 40 | | 41 | 30/07/2021 | | 42 | | 43 | | 44 | 30/07/2021 | | 45 | | 46 | | 47 | | 48 | 30/07/2021 | | 49 | | 50 | | 51 | | 52 | | 53 | 30/07/2021 | | 54 | | 55 | | 56 | | 57 | 30/07/2021 | | 58 | | 59 | | 60 | 30/07/2021 | | 61 | | 62 | | 63 | | 64 | 30/07/2021 | | 65 | | 66 | | 67 | | 1 | 2 | 3 | 4 | | ------ | ----------------- | --------------- | --------------- | ... | 052309 | 180,00 | 258.929,59 | | ... | | | | | ... | | | | ... | 041165 | 455,89 | 259.385,48 | | ... | | | | | ... | | | | ... | 052440 | 180,00 | 259.565,48 | | ... | | | | | ... | | | | ... | 052234 | 180,00 | 259.745,48 | | ... | | | | | ... | | | | ... | 863314 | 202,17 | 259.947,65 | | ... | | | | | ... | | | | ... | 875321 | 15,00 | 259.962,65 | | ... | | | | | ... | | | | | 224723423 | 576,25 | 260.538,90 | | ... | | | | | ... | | | | ... | 873665 | 30,00 | 260.568,90 | | ... | | | | | ... | | | | SI01053 | -15.000,00 | 245.568,90 | | ... | | | | | -309,79 | 245.259,11 | | -3.120,10 | 242.139,01 | | -141,48 | 241.997,53 | | -3.089,90 | 238.907,63 | | -1.150,99 | 237.756,64 | | -383,08 | 237.373,56 | | -9.456,24 | 227.917,32 | | -570,00 | 227.347,32 | | -820,00 | 226.527,32 | | -1.487,99 | 225.039,33 | | -1.021,67 | 224.017,66 | | -965,00 | 223.052,66 | | -871,12 | 222.181,54 | | -2.441,50 | 219.740,04 | | ... | | | | | SEFAZMT-C | -933,10 | 218.806,94 | | ... | | | | | ... | | | | | DARF81COO | -15.037,94 | 203.769,00 | | ... | | | | | ... | | | | | ... | | | | | SI01268 | -3.946,03 | 199.822,97 | | ... | | | | | ... | | | | | ... | | | | | ... | | | | | I01299 | -1.877,45 | 197.945,52 | | ... | | | | | ... | | | | | ... | | | | ... | I01307 | -2.636,85 | 195.308,67 | | ... | | | | | ... | | | | |DARF81COO | -8.485,45 | 186.823,22 | | ... | | | | | ... | | | | | ... | | | | | CX198172 | 50,00 |186.873,22 | | ... | | | | | ... | | | | | ... | | | |

sorry for the long table, that was the way I found to it be clear

so, is there a way to parse the second column all together in a row with the others? the row_tol doesn't work as the rows have different sizes across the pages

It'll be great if there's a way to join them in row by it's colors, as the rows are in a striped style.

Thanks in advance!

opened by alissonsv 4
How gsdll on Windows?

Hi team,

Really need help here. https://stackoverflow.com/questions/69064465/how-to-feed-ghostscript-dll-library-to-python-in-windows I have installed the ghostscript app for windows, but Python still does not "see" it. Details by the link.

Thanks in advance

opened by andkirby 4
We need more maintainers
It seems like camelot is dead:

Last commit: 2021-07-11 - @dimitern is the only other project owner besides @vinayak-mehta

Last PyPI release: 2021-07-11 - @vinayak-mehta is the only owner

Several PRs which look ready to be merged, but are still open

Besides the owner there are only 35 other contributors.

https://opencollective.com/camelot might be another way to check if it's dead.

Does anybody know more? Should we try to transfer the project to https://github.com/jazzband ?
project-governance
opened by MartinThoma 2
While uploading PDF Camelot is unable to read its content
While trying to read a PDF using Camelot it is unable to read its table, I am getting only 0th column data and nothing else. Steps to reproduce the bug

Screenshots

Environment

OS: Windows

Python version: 3.9.12

Numpy version: 1.22.3

OpenCV version: 4.5.5.64

Ghostscript version: 0.7

Camelot version: 0.10.1

Link for PDF

https://www.irf.com/product-info/datasheets/data/irhm9150.pdf
bug
opened by saidakyuz 0
No module named ghostscript
After following Camelot instructions (and a few other dead ends), python is unable to find ghostscript module.

Suggest:

Adding a check during install if the ghostscript python api is installed.

Updating instructions -and-/or install process if appropriate

Steps to reproduce the bug

Installation:

brew update; brew upgrade; # Upgrade and update homebrew

brew install ghostscript

conda -v -n my_env -c conda-forge camelot-py

Note, this also appears to install anacondas ghostscript!

Steps to be used to reproduce behavior:

python3 -c "from ctypes.util import find_library; print(find_library(\"gs\"))" # Outside of conda env

/usr/local/lib/libgs.dylib@ -> ../Cellar/ghostscript/10.0.0/lib/libgs.dylib

python3 -c "import ghostscript"

Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'ghostscript'

conda activate my_env

python -c "from ctypes.util import find_library; print(find_library(\"gs\"))" # Per Camelot Docs

/Users/<username>/opt/anaconda3/envs/finance/bin/../lib/libgs.dylib

python -c "import ghostscript"

Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'ghostscript'

Expected behavior

ghostscript imports succesffully

OR ... some sort of error is thrown during install to notify the user of missing deps

Code

import camelot tables = camelot.read_pdf("./example.pdf")

Environment

OS: macOS 12.4

uname -a

Darwin mylappy.local 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 x86_64

Python version (Conda complete dev env): 3.9.15

Python version (Conda standalone env): 3.11.0

Python version (System): 3.10.9

Numpy version (Conda complete dev env): 1.24.0

Numpy version (Conda standalone env): 1.24.1

OpenCV version (Conda, both envs): 4.6.0

Ghostscript version (Conda, both envs): 9.54

Ghostscript version (System): 10.0

Camelot version (Conda, both envs): 0.10.1

Additional context

I have a development environment in conda with more deps, and also replicated with a fresh env, hopefully the dilineartion is clear in the environment specs.
bug
opened by HepaxCodex 0

PdfFileReader is deprecated and was removed in PyPDF2 3.0.0

Describe the bug

Version 3.0.0 of PyPDF2 was just released today (23 Dec 2022), which includes a breaking change for removing PdfFileReader (see changelog). As a result, all new installs and usage of camelot-py will raise the following exception:

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    camelot.read_pdf(PDF_FILE_PATH)
  File ".venv/py37/lib/python3.7/site-packages/camelot/io.py", line 117, in read_pdf
    **kwargs
  File ".venv/py37/lib/python3.7/site-packages/camelot/handlers.py", line 172, in parse
    self._save_page(self.filepath, p, tempdir)
  File ".venv/py37/lib/python3.7/site-packages/camelot/handlers.py", line 111, in _save_page
    infile = PdfFileReader(fileobj, strict=False)
  File ".venv/py37/lib/python3.7/site-packages/PyPDF2/_reader.py", line 1974, in __init__
    deprecation_with_replacement("PdfFileReader", "PdfReader", "3.0.0")
  File ".venv/py37/lib/python3.7/site-packages/PyPDF2/_utils.py", line 369, in deprecation_with_replacement
    deprecation(DEPR_MSG_HAPPENED.format(old_name, removed_in, new_name))
  File ".venv/py37/lib/python3.7/site-packages/PyPDF2/_utils.py", line 351, in deprecation
    raise DeprecationError(msg)
PyPDF2.errors.DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.

Steps to reproduce the bug

Create a new virtualenv
Install camelot-py:
```
pip install camelot-py[base]
```

Run the following code:

	import camelot

	# replace with a valid path on your local filesystem
	PDF_FILE_PATH = "/path/to/file.pdf"

	# raises an exception from PyPDF2
	camelot.read_pdf(PDF_FILE_PATH)

Expected behavior

The code above should execute without any exceptions.

Environment

OS: macOS 12.3.1
Python version: 3.7
Numpy version: 1.24.0
OpenCV version: 4.6.0.66
Ghostscript version: 0.7
Camelot version: 0.10.1

bug

opened by szeswee 20

Camelot returns tables that contain no text (Where text should be detectable)
I'm trying to extract data from some ~900 certificates. These certificates have an identical visual structure, but are published by different parties. For the majority of files the extraction works. However, for several dozen files, the table-structure returned by Camelot contains only empty strings.

Plotting grid and text shows content is detected (e.g. Table 7 in DS_3663.pdf):

I'm using this command to read the pdf and create the tables: >>> tables=camelot.read_pdf('pdfs/DS_3663.pdf', pages='1-end', line_scale=110, shift_text=[''])

e.g. Table 7 contains this data: >>> tables[7].data [['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', '']]

Here are a few more example pdfs where the extraction fails in an identical manner: DS_885.pdf DS_2481.pdf DS_2083.pdf

Parsing all of these files with pdf2txt.py successfully extracts text, so I assume it should be possible to get a result with Camelot as well.

Environment

OS: Ubuntu 22.04.1 LTS

Python version: 3.10.6

Numpy version: 1.23.4

OpenCV version: 4.6.0.66

Ghostscript version: 9.55.0

Camelot version: 0.9.0

I've tried debugging this, but had difficulties understanding the intricate code in the bbox-sections. From what I've figured out, it appears to me that Camelot is unable to marry horizontal_text (Which contain the relevant text) with the line-grid.
bug
opened by peletiah 0

Owner

Camelot and Excalibur: PDF Table Extraction for Humans

GitHub https://camelot-py.readthedocs.io

A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

49 Nov 7, 2022

Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

4 Dec 5, 2021

Pdfencrypt is a tool to encrypt/lock PDFs

Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:

5 Nov 28, 2021

pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc

2 Dec 17, 2021

Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.

22 Nov 21, 2022

Extract the table in the PDF，outputs the data similar to the json format

extract the table in the PDF，outputs the data similar to the json format

3 Nov 25, 2021

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

281 Jan 1, 2023

Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

1.9k Jan 1, 2023

Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

130 Dec 26, 2022

pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

387 Dec 10, 2022

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

5k Jan 4, 2023

borb is a library for reading, creating and manipulating PDF files in python.

2.9k Jan 1, 2023

x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

73 Dec 19, 2022

pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 3, 2023

Python bindings for MuPDF's rendering library.

PyMuPDF 1.19.3 Release date: December 15, 2021 On PyPI since August 2016: Author Jorj X. McKie, based on original code by Ruikai Liu. Introduction PyM

0 Nov 3, 2022

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

76 Dec 12, 2022

Camelot is a Python library that can help you extract tables from PDFs!

Related tags

Overview

Camelot: PDF Table Extraction for Humans

Why Camelot?

Support the development

Installation

Using conda

Using pip

From the source code

Documentation

Wrappers

Contributing

Versioning

License