Camelot: PDF Table Extraction for Humans
Camelot is a Python library that can help you extract tables from PDFs!
Note: You can also check out Excalibur, the web interface to Camelot!
Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.
>>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite >>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings Improved Speed Decreased Accel Eliminate Stops Decreased Idle 2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4% 2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7% 4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3% 2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2% 4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5% Camelot also comes packaged with a command-line interface!
Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
You can check out some frequently asked questions here.
Why Camelot?
- Configurability: Camelot gives you control over the table extraction process with tweakable settings.
- Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
- Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.
See comparison with similar libraries and tools.
Support the development
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.
Installation
Using conda
The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.
$ conda install -c conda-forge camelot-pyUsing pip
After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:
$ pip install "camelot-py[base]"From the source code
After installing the dependencies, clone the repo using:
$ git clone https://www.github.com/camelot-dev/camelotand install Camelot using pip:
$ cd camelot $ pip install ".[base]"Documentation
The documentation is available at http://camelot-py.readthedocs.io/.
Wrappers
- camelot-php provides a PHP wrapper on Camelot.
Contributing
The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.
Versioning
Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.
License
This project is licensed under the MIT License, see the LICENSE file for details.
Comments
AttributeError from PDFMiner
opened by vinayak-mehta 15@igormp
bugAlthough I can't upload the bad PDF due to NDA reasons, this issue is well documented here, along with some solutions to it, and there's even a PR in place to fix that, but there seems to be no maintainer available to merge it.
I'm not sure how this should be handled, since it's a PDFMiner problem, which seems to be unmaintained, and that reflects directly on camelot.
Doc enhancement: Note dependency on libgs.so (libgs.dylib on Mac) for ghostscript
opened by jimhall 12The Camelot documentation highlights a dependency on Ghostscript and adds a check that confirms that the Ghostscript binary is installed. The key dependency for Camelot to run successfully is on a working copy of the libgs library (libgs.dylib for MacOS).
Specific ask: Would ask that a note be added to the documentation that in addition to running the
gs
binary for version info, add a note that states you require a full distribution ghostscript that includes the libraries and fonts.Details:
I performed the following steps:
- Installed the Camelot python package with no issues
- Installed the conda-forge Ghostscript package with no issues
- Ran the conda-forge gs per the Camelot docs with no issues
Using the conda Ghostscript was a mistake and the Camelot documentation suggests using Homebrew toolchain. But I thought I would be good with using the conda Ghostscript but when ran a test script I got the following error:
>>> import camelot >>> tables = camelot.read_pdf('foo.pdf') Traceback (most recent call last): File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 260, in <module> libgs = cdll.LoadLibrary("libgs.so") File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/ctypes/__init__.py", line 442, in LoadLibrary return self._dlltype(name) File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/ctypes/__init__.py", line 364, in __init__ self._handle = _dlopen(self._name, mode) OSError: dlopen(libgs.so, 6): image not found Steps to reproduce the behavior: During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/io.py", line 117, in read_pdf **kwargs File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/handlers.py", line 172, in parse p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/parsers/lattice.py", line 402, in extract_tables self._generate_image() File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/parsers/lattice.py", line 211, in _generate_image from ..ext.ghostscript import Ghostscript File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/ext/ghostscript/__init__.py", line 24, in <module> from . import _gsprint as gs File "/Users/jameshall/opt/anaconda3/envs/camelot/lib/python3.7/site-packages/camelot/ext/ghostscript/_gsprint.py", line 267, in <module> raise RuntimeError("Please make sure that Ghostscript is installed") RuntimeError: Please make sure that Ghostscript is installed
Looking at the conda package for Ghostscript I determined it only delivered the userland binaries and not the fonts and libraries. I have opened an issue with conda packaging team and asked that the binaries and fonts be delivered.
Workaround: Install Ghostscript using the Homebrew tool chain
Owners of Camelot may argue (rightfully) this is a case of "pilot error / not following the docs". Just would suggest adding a note might prevent what looks like a common pilot error situation (See here and here. Homebrew dependency is pretty heavy weight lift also (you need Xcode for Homebrew to work, so a lot of stuff to download/configure to get going with Camelot).
Thanks for a great tool!
Environment
bug
- macOS Catalina 10.15.6
- Python version: Python 3.7.9 (default, Aug 31 2020, 07:22:35)
- Ghostscript version: 9.22
- Camelot version: 0.8.2
ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfpage'.
opened by Oleh-Hrebchuk 10import camelot Traceback (most recent call last): File "
", line 1, in File "/usr/local/lib/python3.7/site-packages/camelot/init.py", line 6, in from .io import read_pdf File "/usr/local/lib/python3.7/site-packages/camelot/io.py", line 5, in from .handlers import PDFHandler File "/usr/local/lib/python3.7/site-packages/camelot/handlers.py", line 9, in from .parsers import Stream, Lattice File "/usr/local/lib/python3.7/site-packages/camelot/parsers/init.py", line 3, in from .stream import Stream File "/usr/local/lib/python3.7/site-packages/camelot/parsers/stream.py", line 10, in from .base import BaseParser File "/usr/local/lib/python3.7/site-packages/camelot/parsers/base.py", line 5, in from ..utils import get_page_layout, get_text_objects File "/usr/local/lib/python3.7/site-packages/camelot/utils.py", line 17, in from pdfminer.pdfpage import PDFTextExtractionNotAllowed ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfpage' (/usr/local/lib/python3.7/site-packages/pdfminer/pdfpage.py) After build project I see this error. Also, I see that package pdfminer.six-20200720 did release today.
bugReport a Bug in 'table_regions'
opened by Yichen0975 6Describe the bug The parameters for 'table_regions' could not work when using Camelot read a PDF.
When use 'table_regions', I would only get a 'QUANTAL' result: a. read the whole table and the whole PDF; b. or reoprt 'ZeroDivisionError: float division by zero'. (I'm sure I use the correct bbox rules)
Steps to reproduce the bug Steps used to install
camelot
:Steps to reproduce the behavior: Add step here (you can add more steps too) a. Read a PDF with complex tables by Camelot; b. Set flavor = 'stream', and parameters for 'table_regions'; c. Print the table/tables (type: dataframe) with changing the parameters for 'table_regions'. d. Got the result: read the whole PDF (even though the parameters just represent parts of the PDF) or read nothing.
Expected behavior A clear and concise description of what you expected to happen. Read different parts when I change the parameters for for 'table_regions'.
Code Add the Camelot code snippet that you used.
import camelot # add your code here
table = camelot.read_pdf('PATH', flavor = 'stream', table_regions = ['x1,y1,x2,y2']) print(table[0].df]
PDF Add the PDF file that you want to extract tables from. I could not do this since it's confidential.
Screenshots If applicable, add screenshots to help explain your problem.
Environment
- OS: MacOS Big Sur 11.4
- Python version: 3.9
- Numpy version: 1.20.3
- OpenCV version: 4.5.2.52
- Ghostscript version: 0.7
- Camelot version: 0.8.2
Additional context Add any other context about the problem here.
bugUnicodeEncodeError when using Stream flavor
opened by stpete111 6Python 3.7 on Windows
Using this pdf: http://tsbde.texas.gov/78i8ljhbj/Fiscal-Year-2014-Disciplinary-Actions.pdf
I am running it through Camelot to convert to html using Stream flavor and I get the following error at execution of the
export
line, once it reaches page 4 of 8:"UnicodeEncodeError -'charmap' codec can't encode character '\u2010' in position y: character maps to undefined."
Pages 1 through 3 get converted nicely - it crashes somewhere between page 4 and 5. In debug with the breakpoint after the
tables.export
line, it also brings me to line 19 of cp1252.py, if that's helpful.I am on Windows, and this seems not to be an issue on Mac. But Windows is our environment so I have to figure this out. I have done a ton of research on this error and everything for this in Python world points to either adding
encoding="utf-8"
orerrors="ignore"
, but those both relate to thefile.read
method and can't be used in Camelot'sexport
method.Any thoughts on what I could add to the script to get around this error? We can't avoid using Windows, and this seems to be the final blocker for us for being able to really make great use of this tool for our PDF's.
ValueError: max() arg is an empty sequence
opened by SebastianDeLaile 6When running on this document (https://www.qao.qld.gov.au/sites/qao/files/annual-reports/annual_report_2016-17.pdf), when it reaches page 4, it throws the following ValueError:
import camelot camelot.read_pdf(path, pages='3', flavor='stream')
Traceback (most recent call last): File "", line 2, in
File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\io.py", line 117, in read_pdf **kwargs File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\handlers.py", line 172, in parse p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 458, in extract_tables cols, rows = self._generate_columns_and_rows(table_idx, tk) File "C:\Users\sdelail\AppData\Local\Continuum\anaconda3\envs\Financial_Extraction\lib\site-packages\camelot\parsers\stream.py", line 349, in _generate_columns_and_rows ncols = max(set(elements), key=elements.count) ValueError: max() arg is an empty sequence Easy enough to capture with a try/except but thought I would pop it up here to let you know Thanks for writing this package, excellent work!
Add pdftopng and use ghostscript as fallback
opened by vinayak-mehta 5Raised this PR just to run tests. This change would have to be handled in a backwards compatible way by adding support for multiple image conversion backends which the user could specify (
ghostscript
,mupdf
, etc.), and settingpdftopng
as the default one.[MRG] Use correct re.sub signature
opened by pevisscher 5
text_strip
currently passes the regex flags as the count parameters, which is hardcoded tore.UNICODE
(value 32), and thus only replaces the first 32 values.see https://docs.python.org/3/library/re.html#re.sub for the signature
Fixed library discovery on Windows
opened by KOLANICH 5When
gsdll.dll
is not found in the usual place in the Windows registry,ctypes.util.find_library()
is used as a fallback to search%PATH%
, before raisingRuntimeError
.parsing PDF with rows in different sizes
opened by alissonsv 4Hello, I'm trying to parse this PDF been a while, and tried almost all configs, even been read the code to find a solution.
as you can see, the date at first column is not at the top of the cell, so the table get parsed like this: the
...
are the cells that I had to censor| | 0 | 1 | 2 | 3 | 4 | |---- |----------------- | ------ | ----------------- | --------------- | --------------- | | 0 | 30/07/2021| ... | 052309 | 180,00 | 258.929,59 | | 1 | | ... | | | | | 2 | | ... | | | | | 3 | 30/07/2021 | ... | 041165 | 455,89 | 259.385,48 | | 4 | | ... | | | | | 5 | | ... | | | | | 6 | 30/07/2021 | ... | 052440 | 180,00 | 259.565,48 | | 7 | | ... | | | | | 8 | | ... | | | | | 9 | 30/07/2021 | ... | 052234 | 180,00 | 259.745,48 | | 10 | | ... | | | | | 11 | | ... | | | | | 12 | 30/07/2021 | ... | 863314 | 202,17 | 259.947,65 | | 13 | | ... | | | | | 14 | | ... | | | | | 15 | 30/07/2021 | ... | 875321 | 15,00 | 259.962,65 | | 16 | | ... | | | | | 17 | | ... | | | | | 18 | 30/07/2021 | | 224723423 | 576,25 | 260.538,90 | | 19 | | ... | | | | | 20 | | ... | | | | | 21 | 30/07/2021 | ... | 873665 | 30,00 | 260.568,90 | | 22 | | ... | | | | | 23 | | ... | | | | | 24 | 30/07/2021 | ... | SI01053 | -15.000,00 | 245.568,90 | | 25 | | ... | | | | | 26 | 30/07/2021 | ... | | -309,79 | 245.259,11 | | 27 | 30/07/2021 | ... | | -3.120,10 | 242.139,01 | | 28 | 30/07/2021 | ... | | -141,48 | 241.997,53 | | 29 | 30/07/2021 | ... | | -3.089,90 | 238.907,63 | | 30 | 30/07/2021 | ... | | -1.150,99 | 237.756,64 | | 31 | 30/07/2021 | ... | | -383,08 | 237.373,56 | | 32 | 30/07/2021 | ... | | -9.456,24 | 227.917,32 | | 33 | 30/07/2021 | ... | | -570,00 | 227.347,32 | | 34 | 30/07/2021 | ... | | -820,00 | 226.527,32 | | 35 | 30/07/2021 | ... | | -1.487,99 | 225.039,33 | | 36 | 30/07/2021 | ... | | -1.021,67 | 224.017,66 | | 37 | 30/07/2021 | ... | | -965,00 | 223.052,66 | | 38 | 30/07/2021 | ... | | -871,12 | 222.181,54 | | 39 | 30/07/2021 | ... | | -2.441,50 | 219.740,04 | | 40 | | ... | | | | | 41 | 30/07/2021 | | SEFAZMT-C | -933,10 | 218.806,94 | | 42 | | ... | | | | | 43 | | ... | | | | | 44 | 30/07/2021 | | DARF81COO | -15.037,94 | 203.769,00 | | 45 | | ... | | | | | 46 | | ... | | | | | 47 | | ... | | | | | 48 | 30/07/2021 | | SI01268 | -3.946,03 | 199.822,97 | | 49 | | ... | | | | | 50 | | ... | | | | | 51 | | ... | | | | | 52 | | ... | | | | | 53 | 30/07/2021 | | I01299 | -1.877,45 | 197.945,52 | | 54 | | ... | | | | | 55 | | ... | | | | | 56 | | ... | | | | | 57 | 30/07/2021 | ... | I01307 | -2.636,85 | 195.308,67 | | 58 | | ... | | | | | 59 | | ... | | | | | 60 | 30/07/2021 | |DARF81COO | -8.485,45 | 186.823,22 | | 61 | | ... | | | | | 62 | | ... | | | | | 63 | | ... | | | | | 64 | 30/07/2021 | | CX198172 | 50,00 |186.873,22 | | 65 | | ... | | | | | 66 | | ... | | | | | 67 | | ... | | | |
sorry for the long table, that was the way I found to it be clear
so, is there a way to parse the second column all together in a row with the others? the row_tol doesn't work as the rows have different sizes across the pages
It'll be great if there's a way to join them in row by it's colors, as the rows are in a striped style.
Thanks in advance!
How gsdll on Windows?
opened by andkirby 4Hi team,
Really need help here. https://stackoverflow.com/questions/69064465/how-to-feed-ghostscript-dll-library-to-python-in-windows I have installed the ghostscript app for windows, but Python still does not "see" it. Details by the link.
Thanks in advance
We need more maintainers
opened by MartinThoma 2It seems like camelot is dead:
- Last commit: 2021-07-11 - @dimitern is the only other project owner besides @vinayak-mehta
- Last PyPI release: 2021-07-11 - @vinayak-mehta is the only owner
- Several PRs which look ready to be merged, but are still open
Besides the owner there are only 35 other contributors.
https://opencollective.com/camelot might be another way to check if it's dead.
Does anybody know more? Should we try to transfer the project to https://github.com/jazzband ?
project-governanceWhile uploading PDF Camelot is unable to read its content
opened by saidakyuz 0While trying to read a PDF using Camelot it is unable to read its table, I am getting only 0th column data and nothing else. Steps to reproduce the bug
Screenshots
Environment
- OS: Windows
- Python version: 3.9.12
- Numpy version: 1.22.3
- OpenCV version: 4.5.5.64
- Ghostscript version: 0.7
- Camelot version: 0.10.1
Link for PDF
https://www.irf.com/product-info/datasheets/data/irhm9150.pdf
bugNo module named ghostscript
opened by HepaxCodex 0After following Camelot instructions (and a few other dead ends), python is unable to find ghostscript module.
Suggest:
- Adding a check during install if the ghostscript python api is installed.
- Updating instructions -and-/or install process if appropriate
Steps to reproduce the bug
Installation:
brew update; brew upgrade; # Upgrade and update homebrew
brew install ghostscript
conda -v -n my_env -c conda-forge camelot-py
- Note, this also appears to install anacondas ghostscript!
Steps to be used to reproduce behavior:
python3 -c "from ctypes.util import find_library; print(find_library(\"gs\"))" # Outside of conda env
/usr/local/lib/libgs.dylib@ -> ../Cellar/ghostscript/10.0.0/lib/libgs.dylib
python3 -c "import ghostscript"
Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'ghostscript'
conda activate my_env
python -c "from ctypes.util import find_library; print(find_library(\"gs\"))" # Per Camelot Docs
/Users/<username>/opt/anaconda3/envs/finance/bin/../lib/libgs.dylib
python -c "import ghostscript"
Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'ghostscript'
Expected behavior
- ghostscript imports succesffully
- OR ... some sort of error is thrown during install to notify the user of missing deps
Code
import camelot tables = camelot.read_pdf("./example.pdf")
Environment
- OS: macOS 12.4
uname -a
Darwin mylappy.local 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 x86_64
- Python version (Conda complete dev env): 3.9.15
- Python version (Conda standalone env): 3.11.0
- Python version (System): 3.10.9
- Numpy version (Conda complete dev env): 1.24.0
- Numpy version (Conda standalone env): 1.24.1
- OpenCV version (Conda, both envs): 4.6.0
- Ghostscript version (Conda, both envs): 9.54
- Ghostscript version (System): 10.0
- Camelot version (Conda, both envs): 0.10.1
Additional context
I have a development environment in conda with more deps, and also replicated with a fresh env, hopefully the dilineartion is clear in the environment specs.
bugPdfFileReader is deprecated and was removed in PyPDF2 3.0.0
opened by szeswee 20Describe the bug
Version 3.0.0 of
PyPDF2
was just released today (23 Dec 2022), which includes a breaking change for removingPdfFileReader
(see changelog). As a result, all new installs and usage ofcamelot-py
will raise the following exception:Traceback (most recent call last): File "test.py", line 9, in <module> camelot.read_pdf(PDF_FILE_PATH) File ".venv/py37/lib/python3.7/site-packages/camelot/io.py", line 117, in read_pdf **kwargs File ".venv/py37/lib/python3.7/site-packages/camelot/handlers.py", line 172, in parse self._save_page(self.filepath, p, tempdir) File ".venv/py37/lib/python3.7/site-packages/camelot/handlers.py", line 111, in _save_page infile = PdfFileReader(fileobj, strict=False) File ".venv/py37/lib/python3.7/site-packages/PyPDF2/_reader.py", line 1974, in __init__ deprecation_with_replacement("PdfFileReader", "PdfReader", "3.0.0") File ".venv/py37/lib/python3.7/site-packages/PyPDF2/_utils.py", line 369, in deprecation_with_replacement deprecation(DEPR_MSG_HAPPENED.format(old_name, removed_in, new_name)) File ".venv/py37/lib/python3.7/site-packages/PyPDF2/_utils.py", line 351, in deprecation raise DeprecationError(msg) PyPDF2.errors.DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.
Steps to reproduce the bug
- Create a new virtualenv
- Install
camelot-py
:pip install camelot-py[base]
- Run the following code:
import camelot # replace with a valid path on your local filesystem PDF_FILE_PATH = "/path/to/file.pdf" # raises an exception from PyPDF2 camelot.read_pdf(PDF_FILE_PATH)
Expected behavior
The code above should execute without any exceptions.
Environment
bug
- OS: macOS 12.3.1
- Python version: 3.7
- Numpy version: 1.24.0
- OpenCV version: 4.6.0.66
- Ghostscript version: 0.7
- Camelot version: 0.10.1
Camelot returns tables that contain no text (Where text should be detectable)
opened by peletiah 0I'm trying to extract data from some ~900 certificates. These certificates have an identical visual structure, but are published by different parties. For the majority of files the extraction works. However, for several dozen files, the table-structure returned by Camelot contains only empty strings.
Plotting
grid
andtext
shows content is detected (e.g. Table 7 in DS_3663.pdf):I'm using this command to read the pdf and create the tables:
>>> tables=camelot.read_pdf('pdfs/DS_3663.pdf', pages='1-end', line_scale=110, shift_text=[''])
e.g. Table 7 contains this data:
>>> tables[7].data
[['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', '', '', '', '', '']]
Here are a few more example pdfs where the extraction fails in an identical manner: DS_885.pdf DS_2481.pdf DS_2083.pdf
Parsing all of these files with
pdf2txt.py
successfully extracts text, so I assume it should be possible to get a result with Camelot as well.Environment
- OS: Ubuntu 22.04.1 LTS
- Python version: 3.10.6
- Numpy version: 1.23.4
- OpenCV version: 4.6.0.66
- Ghostscript version: 9.55.0
- Camelot version: 0.9.0
I've tried debugging this, but had difficulties understanding the intricate code in the bbox-sections. From what I've figured out, it appears to me that Camelot is unable to marry
bughorizontal_text
(Which contain the relevant text) with the line-grid.Owner
Camelot and Excalibur: PDF Table Extraction for HumansA python library for extracting text from PDFs without losing the formatting of the PDF content.
Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins
49 Nov 7, 2022Auto Convert PDFs to png files in python
This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files
4 Dec 5, 2021Pdfencrypt is a tool to encrypt/lock PDFs
Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:
5 Nov 28, 2021pdf_sprinkles: sprinkles text in your PDFs
pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc
2 Dec 17, 2021Scans pdfs for links written in plaintext and checks if they are active or returns an error code.
Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.
22 Nov 21, 2022Extract the table in the PDF,outputs the data similar to the json format
extract the table in the PDF,outputs the data similar to the json format
3 Nov 25, 2021This book will take you on an exploratory journey through the PDF format, and the borb Python library.
This book will take you on an exploratory journey through the PDF format, and the borb Python library.
281 Jan 1, 2023Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator
Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene
1.9k Jan 1, 2023Telegram bot that can do a lot of things related to PDF files.
Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif
130 Dec 26, 2022pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input
pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark
387 Dec 10, 2022PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.
5k Jan 4, 2023borb is a library for reading, creating and manipulating PDF files in python.
borb is a library for reading, creating and manipulating PDF files in python.
2.9k Jan 1, 2023x-ray is a Python library for finding bad redactions in PDF documents.
A tool to detect whether a PDF has a bad redaction
73 Dec 19, 2022pikepdf is a Python library for reading and writing PDF files.
A Python library for reading and writing PDF, powered by qpdf
1.6k Jan 3, 2023Python bindings for MuPDF's rendering library.
PyMuPDF 1.19.3 Release date: December 15, 2021 On PyPI since August 2016: Author Jorj X. McKie, based on original code by Ruikai Liu. Introduction PyM
0 Nov 3, 2022Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.
Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.
76 Dec 12, 2022Python PDF Parser (Not actively maintained). Check out pdfminer.six.
PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi
4.9k Jan 4, 2023PyMuPDF is a Python binding with support for MuPDF
PyMuPDF is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.
1.9k Jan 3, 2023A Python tool to generate a static HTML file that represents the internal structure of a PDF file
PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve
394 Dec 30, 2022