Excalibur: A web interface to extract tabular data from PDFs

Overview

Excalibur: A web interface to extract tabular data from PDFs

Documentation Status image image image Gitter chat image image

Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It is powered by Camelot.

Note: Excalibur only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Using Excalibur

Note: You need to install ghostscript before moving forward.

After installing Excalibur with pip, you need to initialize the metadata database using:

$ excalibur initdb

And then start the webserver using:

$ excalibur webserver

That's it! Now you can go to http://localhost:5000 and start extracting tabular data from your PDFs.

  1. Upload a PDF and enter the page numbers you want to extract tables from.

  2. Go to each page and select the table by drawing a box around it. (You can choose to skip this step since Excalibur can automatically detect tables on its own. Click on "Autodetect tables" to see what Excalibur sees.)

  3. Choose a flavor (Lattice or Stream) from "Advanced".

    a. Lattice: For tables formed with lines.

    b. Stream: For tables formed with whitespaces.

  4. Click on "View and download data" to see the extracted tables.

  5. Select your favorite format (CSV/Excel/JSON/HTML) and click on "Download"!

Note: You can also download executables for Windows and Linux from the releases page and run them directly!

usage.gif

Why Excalibur?

  • Extracting tables from PDFs is hard. A simple copy-and-paste from a PDF into an Excel doesn't preserve table structure. Excalibur makes PDF table extraction very easy, by automatically detecting tables in PDFs and letting you save them into CSVs and Excel files.
  • Excalibur uses Camelot under the hood, which gives you additional settings to tweak table extraction and get the best results. You can see how it performs better than other open-source tools and libraries in this comparison.
  • You can save table extraction settings (like table areas) for a PDF once, and apply them on new PDFs to extract tables with similar structures.
  • You get complete control over your data. All file storage and processing happens on your own local or remote machine.
  • Excalibur can be configured with MySQL and Celery for parallel and distributed workloads. By default, sqlite and multiprocessing are used for sequential workloads.

Installation

Using pip

After installing ghostscript, which is one of the requirements for Camelot (See install instructions), you can simply use pip to install Excalibur:

$ pip install excalibur-py

From the source code

After installing ghostscript, clone the repo using:

$ git clone https://www.github.com/camelot-dev/excalibur

and install Excalibur using pip:

$ cd excalibur
$ pip install .

Documentation

Fantastic documentation is available at http://excalibur-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/camelot-dev/excalibur

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install excalibur-py[dev]

Testing (soon)

After installation, you can run tests using:

$ python setup.py test

Versioning

Excalibur uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Support the development

You can support our work on Excalibur with a one-time or monthly donation on OpenCollective. Organizations who use Excalibur can also sponsor the project for an acknowledgement on our official site and this README.

Special thanks to all the users and organizations that support Excalibur!

Comments
  • ImportError: cannot import name 'secure_filename' after `excalibur initdb`

    ImportError: cannot import name 'secure_filename' after `excalibur initdb`

    I can't start the database after running the command: excalibur initdb

    I get this error:

    ~$ excalibur initdb Creating new Excalibur configuration file in: /home/localhost/excalibur/excalibur.cfg Traceback (most recent call last): File "/home/localhost/.local/bin/excalibur", line 5, in from excalibur.cli import cli File "/home/localhost/.local/lib/python3.6/site-packages/excalibur/cli.py", line 12, in from .www.app import create_app File "/home/localhost/.local/lib/python3.6/site-packages/excalibur/www/app.py", line 7, in from .views import views File "/home/localhost/.local/lib/python3.6/site-packages/excalibur/www/views.py", line 10, in from werkzeug import secure_filename ImportError: cannot import name 'secure_filename'

    It seems that the library mentioned is already installed.

    ~$ pip3 install werkzeug Requirement already satisfied: werkzeug in ./.local/lib/python3.6/site-packages (1.0.0) WARNING: You are using pip version 19.2.3, however version 20.0.2 is available. You should consider upgrading via the 'pip install --upgrade pip' command.

    Any tip?

    opened by belisards 9
  • AttributeError Nonetype for 'job_id'

    AttributeError Nonetype for 'job_id'

    Here's the print out of the problem. I'm getting a 500 internal server error. I'm in python 3.7. Camelot works fine for me (I can parse, read, export no problems). Just have a problem with running Excalibur.

    • Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) 127.0.0.1 - - [06/Nov/2018 21:30:22] "GET / HTTP/1.1" 302 - [2018-11-06 21:30:22,663] ERROR in app: Exception on /files [GET] Traceback (most recent call last): File "d:\python3.7\lib\site-packages\flask\app.py", line 2292, in wsgi_app response = self.full_dispatch_request() File "d:\python3.7\lib\site-packages\flask\app.py", line 1815, in full_dispatch_request rv = self.handle_user_exception(e) File "d:\python3.7\lib\site-packages\flask\app.py", line 1718, in handle_user_exception reraise(exc_type, exc_value, tb) File "d:\python3.7\lib\site-packages\flask_compat.py", line 35, in reraise raise value File "d:\python3.7\lib\site-packages\flask\app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "d:\python3.7\lib\site-packages\flask\app.py", line 1799, in dispatch_request return self.view_functionsrule.endpoint File "d:\python3.7\lib\site-packages\excalibur\www\views.py", line 39, in files 'job_id': job.job_id, AttributeError: 'NoneType' object has no attribute 'job_id'

    Here's the code that contains the 'job_id' line 39 from the www\views.py file

    @views.route('/files', methods=['GET', 'POST']) def files(): if request.method == 'GET': files_response = [] session = Session() for file in session.query(File).order_by(File.uploaded_at.desc()).all(): job = session.query(Job).filter(Job.file_id == file.file_id).order_by(Job.started_at.desc()).first() files_response.append({ 'file_id': file.file_id, 'job_id': job.job_id, 'uploaded_at': file.uploaded_at.strftime('%Y-%m-%dT%H:%M:%S'), 'filename': file.filename })


    Any thoughts or am I making some stupid mistakes here?

    bug 
    opened by willardgtan 9
  • Processing PDF - error message

    Processing PDF - error message

    I am on ubuntu 14.04 and installed excalibur-py using pip. while processing the following pdf (this is also used in camilot-py) and works well... the system returns the following message -

    ERROR:root:'Table' object has no attribute '_bbox' Traceback (most recent call last): File "/home/sandeep/anaconda3/lib/python3.6/site-packages/excalibur/tasks.py", line 96, in split x1, y1, x2, y2 = tables[0]._bbox AttributeError: 'Table' object has no attribute '_bbox' Refresh does not change anything... if i click on excalibur then i get this msg back "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application." background_lines.pdf

    pdf file used is attached

    opened by sandeepraizada 7
  • Error on Windows: OSError: exception: access violation writing 0x0967BC48 while running python-Excalibur code

    Error on Windows: OSError: exception: access violation writing 0x0967BC48 while running python-Excalibur code

    camelot Excalibur thow an oserror:access violation writing 0x0967BC48 os - Windows 10 python version - 3.7

    below is the output screen

    • Debug mode: off
    • Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) 127.0.0.1 - - [02/Jul/2019 19:03:54] "GET /files HTTP/1.1" 200 - 127.0.0.1 - - [02/Jul/2019 19:04:10] "POST /files HTTP/1.1" 200 - 127.0.0.1 - - [02/Jul/2019 19:04:10] "GET /workspaces/59f1c984-31fa-4ade-b944-770072f82827 HTTP/1.1" 200 - 127.0.0.1 - - [02/Jul/2019 19:04:19] "GET /workspaces/59f1c984-31fa-4ade-b944-770072f82827 HTTP/1.1" 200 - 127.0.0.1 - - [02/Jul/2019 19:04:20] "GET /static/favicon.ico HTTP/1.1" 200 - ERROR:root:exception: access violation writing 0x0967BC48 Traceback (most recent call last): File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\excalibur\tasks.py", line 44, in split with Ghostscript(*gs_call, stdout=null) as gs: File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 93, in Ghostscript stderr=kwargs.get('stderr', None)) File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_init_.py", line 39, in init rc = gs.init_with_args(instance, args) File "c:\users\comp7\appdata\local\programs\python\python37-32\lib\site-packages\camelot\ext\ghostscript_gsprint.py", line 167, in init_with_args rc = libgs.gsapi_init_with_args(instance, len(argv), c_argv) OSError: exception: access violation writing 0x0967BC48

    excal

    opened by swaraj20 6
  • GhostscriptNotFound: Please make sure that Ghostscript is installed and available on the PATH environment variable

    GhostscriptNotFound: Please make sure that Ghostscript is installed and available on the PATH environment variable

    OS: Windows 10

    After downloading excalibur, I had to download/install ghostscript (this should be stated in instructions).

    After installing ghostscript, the PATH needs to be set. After setting PATH, the error still exists: image

    PATH set: C:\Program Files\gs\gs9.26\bin\gswin64c.exe (I've restarted)

    opened by majestique 5
  • werkzeug.utils not werkzeug

    werkzeug.utils not werkzeug

    excalibur would not start on my linux Fedora 29 system. here was the message:

    .local/bin/excalibur Traceback (most recent call last): File ".local/bin/excalibur", line 5, in from excalibur.cli import cli File "/home/thorsten/.local/lib/python3.6/site-packages/excalibur/cli.py", line 12, in from .www.app import create_app File "/home/thorsten/.local/lib/python3.6/site-packages/excalibur/www/app.py", line 7, in from .views import views File "/home/thorsten/.local/lib/python3.6/site-packages/excalibur/www/views.py", line 10, in from werkzeug import secure_filename ImportError: cannot import name 'secure_filename'

    After some googling I fixed it myself by editing line 10 in views.py to read

    from werkzeug.utils import secure_filename

    This seems like a very simple issue to fix.

    After that, the program worked brilliantly. I found it on

    https://hackernoon.com/an-open-source-science-tool-to-extract-tables-from-pdfs-into-excels-3ed3cc7f22e1

    John Thorstensen

    opened by jrthorstensen 4
  • Change the import from werkzeug

    Change the import from werkzeug

    python 3.6, werkzeug 1.0, on ubuntu 18.04 in WSL (Windows) I couldn't run due to the error

    from werkzeug import secure_filename
    ImportError: cannot import name 'secure_filename'
    
    opened by sabas 3
  • Excalibur's data directory is created in HOME

    Excalibur's data directory is created in HOME

    I consider it bad form for Excalibur to create a user-visible folder in the home folder (/Users/akx/excalibur on my Mac, for instance).

    It'd be better to use e.g. appdirs to figure out the "user data" directory, and create the Excalibur directory there.

    opened by akx 3
  • ERROR:root:'charmap' codec can't encode character '\ued6f' in position 350: char acter maps to <undefined>

    ERROR:root:'charmap' codec can't encode character '\ued6f' in position 350: char acter maps to

    I am unable to share the pdf that is causing this issue. I would like to know what I can do to bypass this error.
    Even if it means dropping the "offending char". Getting some of the data is better than getting none of the data. I'd be ecstatic if this is a PEBKAC issue, so please don't discount that.

    Using the latest download of excaliber and Python 3.7.3 (I think). Only using the webui to do this. Don't think I could handle coding it, without some hand holding.

    This is happening on several pages of a very large pdf (700+ pages). But not all of them. So the file can be parsed. Just not the important portion, which is most of the file.

    I DID just realize that it is creating some of the output files (excel, csv, and json), but not html. Since I on'y really need the csv or excel, I might be good. Will keep pushing on the remainder of the file (its slow to handle 100 pages at a time)

    127.0.0.1 - - [05/Aug/2019 16:55:09] "GET /jobs/fbfeb974-5f3d-4991-b26c-98356064
    0de5 HTTP/1.1" 200 -
    ERROR:root:'charmap' codec can't encode character '\ued6f' in position 350: char
    acter maps to <undefined>
    Traceback (most recent call last):
      File "excalibur\executors\sequential_executor.py", line 12, in execute_command
    
      File "subprocess.py", line 336, in check_call
      File "subprocess.py", line 317, in call
      File "subprocess.py", line 769, in __init__
      File "subprocess.py", line 1172, in _execute_child
    FileNotFoundError: [WinError 2] The system cannot find the file specified
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "excalibur\tasks.py", line 161, in extract
      File "lib\site-packages\camelot\core.py", line 479, in export
      File "lib\site-packages\camelot\core.py", line 437, in _write_file
      File "lib\site-packages\camelot\core.py", line 394, in to_html
      File "C:\Python37\lib\encodings\cp1252.py", line 19, in encode
    UnicodeEncodeError: 'charmap' codec can't encode character '\ued6f' in position
    350: character maps to <undefined>
    
    opened by randomstability 3
  • Failure when

    Failure when "all" or 1-end is selected

    Excalibur struggles on large pdfs (20pgs or more) when I indicate the "all" or "1-end" options. I get the following warning: UserWarning: No tables found on page-144 [lattice.py:399] UserWarning: No tables found on page-144 [stream.py:447] UserWarning: No tables found in table area 1 [stream.py:361] UserWarning: No tables found in table area 1 [stream.py:361] UserWarning: No tables found in table area 2 [stream.py:361]

    However if I manually select the pages it works fine. Is there a way to solve this?

    opened by VAnthonyrajah 3
  • [Flavor] - Selecting flavor while extracting table which requires to process background.

    [Flavor] - Selecting flavor while extracting table which requires to process background.

    I have a pdf which has multiple tables having some cells with margins having colored background and some having no margin at all, only having the background color difference. While selecting the flavor as lattice or stream the alignment of the extracted text is getting disturbed a lot in case of extracting without margin cell values from tables. I even tried the same with process_background = True which is not solving the problem.

    Is there any way to resolve the issue?

    opened by Akhurana01 2
  • Fix for

    Fix for "No module named camelot.ext" error

    When i followed the instructions using excalibur I ran into the following issue.

    Traceback (most recent call last):
      File "/Users/balakumaranpalanivel/.pyenv/versions/3.7.9/bin/excalibur", line 5, in <module>
        from excalibur.cli import cli
      File "/Users/balakumaranpalanivel/ReposPersonal/excaliburRoot/excalibur-fork/excalibur/cli.py", line 8, in <module>
        from .tasks import split, extract
      File "/Users/balakumaranpalanivel/ReposPersonal/excaliburRoot/excalibur-fork/excalibur/tasks.py", line 9, in <module>
        from camelot.ext.ghostscript import Ghostscript
    ModuleNotFoundError: No module named 'camelot.ext'
    

    There seems to be multiple different ways to fix online. But the root cause seems to be this commit where the ext folder was removed in camelot but excalibur continues to use it.

    This fix seems to be the most popular one based on stackoverflow upvotes and makes sense to me. But please correct me if am wrong.

    P.S - I had a look atcontributing guidelines, i hope i did not miss anything 🤞

    opened by balakumaranpalanivel 0
  • Bump decode-uri-component from 0.2.0 to 0.2.2 in /public

    Bump decode-uri-component from 0.2.0 to 0.2.2 in /public

    Bumps decode-uri-component from 0.2.0 to 0.2.2.

    Release notes

    Sourced from decode-uri-component's releases.

    v0.2.2

    • Prevent overwriting previously decoded tokens 980e0bf

    https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.1...v0.2.2

    v0.2.1

    • Switch to GitHub workflows 76abc93
    • Fix issue where decode throws - fixes #6 746ca5d
    • Update license (#1) 486d7e2
    • Tidelift tasks a650457
    • Meta tweaks 66e1c28

    https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.1

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 0
  • Error during `excalibur initdb` on Windows 10

    Error during `excalibur initdb` on Windows 10

    
    C:\Users\user\Documents\MLReportParser>excalibur initdb
    Traceback (most recent call last):
      File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "C:\Users\user\AppData\Local\Programs\Python\Python39\Scripts\excalibur.exe\__main__.py", line 4, in <module>  File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\excalibur\cli.py", line 7, in <module>
        from . import __version__, settings
      File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\excalibur\settings.py", line 6, in <module>
        from sqlalchemy import create_engine, exc
      File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\__init__.py", line 12, in <module>
        from sqlalchemy.sql import (
      File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\sql\__init__.py", line 7, in <module>
        from sqlalchemy.sql.expression import (
      File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\sql\expression.py", line 32, in <module>
        from sqlalchemy import util, exc
      File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\util\__init__.py", line 7, in <module>
        from .compat import callable, cmp, reduce, defaultdict, py25_dict, \
      File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlalchemy\util\compat.py", line 202, in <module>
        time_func = time.clock
    AttributeError: module 'time' has no attribute 'clock'
    
    
    opened by js333031 0
  • Bump engine.io and browser-sync in /public

    Bump engine.io and browser-sync in /public

    Bumps engine.io to 6.2.1 and updates ancestor dependency browser-sync. These dependencies need to be updated together.

    Updates engine.io from 3.2.0 to 6.2.1

    Release notes

    Sourced from engine.io's releases.

    6.2.1

    :warning: This release contains an important security fix :warning:

    A malicious client could send a specially crafted HTTP request, triggering an uncaught exception and killing the Node.js process:

    Error: read ECONNRESET
        at TCP.onStreamRead (internal/stream_base_commons.js:209:20)
    Emitted 'error' event on Socket instance at:
        at emitErrorNT (internal/streams/destroy.js:106:8)
        at emitErrorCloseNT (internal/streams/destroy.js:74:3)
        at processTicksAndRejections (internal/process/task_queues.js:80:21) {
      errno: -104,
      code: 'ECONNRESET',
      syscall: 'read'
    }
    

    Please upgrade as soon as possible.

    Bug Fixes

    • catch errors when destroying invalid upgrades (#658) (425e833)

    6.2.0

    Features

    • add the "maxPayload" field in the handshake details (088dcb4)

    So that clients in HTTP long-polling can decide how many packets they have to send to stay under the maxHttpBufferSize value.

    This is a backward compatible change which should not mandate a new major revision of the protocol (we stay in v4), as we only add a field in the JSON-encoded handshake data:

    0{"sid":"lv_VI97HAXpY6yYWAAAC","upgrades":["websocket"],"pingInterval":25000,"pingTimeout":5000,"maxPayload":1000000}
    

    Links

    6.1.3

    Bug Fixes

    • typings: allow CorsOptionsDelegate as cors options (#641) (a463d26)
    • uws: properly handle chunked content (#642) (3367440)

    ... (truncated)

    Changelog

    Sourced from engine.io's changelog.

    6.2.1 (2022-11-20)

    :warning: This release contains an important security fix :warning:

    A malicious client could send a specially crafted HTTP request, triggering an uncaught exception and killing the Node.js process:

    Error: read ECONNRESET
        at TCP.onStreamRead (internal/stream_base_commons.js:209:20)
    Emitted 'error' event on Socket instance at:
        at emitErrorNT (internal/streams/destroy.js:106:8)
        at emitErrorCloseNT (internal/streams/destroy.js:74:3)
        at processTicksAndRejections (internal/process/task_queues.js:80:21) {
      errno: -104,
      code: 'ECONNRESET',
      syscall: 'read'
    }
    

    Please upgrade as soon as possible.

    Bug Fixes

    • catch errors when destroying invalid upgrades (#658) (425e833)

    3.6.0 (2022-06-06)

    Bug Fixes

    Features

    • decrease the default value of maxHttpBufferSize (58e274c)

    This change reduces the default value from 100 mb to a more sane 1 mb.

    This helps protect the server against denial of service attacks by malicious clients sending huge amounts of data.

    See also: https://github.com/advisories/GHSA-j4f2-536g-r55m

    • increase the default value of pingTimeout (f55a79a)

    ... (truncated)

    Commits
    • 24b847b chore(release): 6.2.1
    • 425e833 fix: catch errors when destroying invalid upgrades (#658)
    • 99adb00 chore(deps): bump xmlhttprequest-ssl and engine.io-client in /examples/latenc...
    • d196f6a chore(deps): bump minimatch from 3.0.4 to 3.1.2 (#660)
    • 7c1270f chore(deps): bump nanoid from 3.1.25 to 3.3.1 (#659)
    • 535a01d ci: add Node.js 18 in the test matrix
    • 1b71a6f docs: remove "Vanilla JS" highlight from README (#656)
    • 917d1d2 refactor: replace deprecated String.prototype.substr() (#646)
    • 020801a chore: add changelog for version 3.6.0
    • ed1d6f9 test: make test script work on Windows (#643)
    • Additional commits viewable in compare view

    Updates browser-sync from 2.24.7 to 2.27.10

    Release notes

    Sourced from browser-sync's releases.

    2.27.9

    What's Changed

    A bug prevented the help output from displaying - it was introduced when the CLI parser yargs was updated, and is now fixed :)

    Full Changelog: https://github.com/BrowserSync/browser-sync/compare/v2.27.8...v2.27.9

    2.27.8

    This release upgrades Socket.io (client+server) to the latest versions - solving the following issues, and silencing security warning :)

    PR:

    Resolved Issues:

    Thanks to @​lachieh for the original PR, which helped me land this fix

    added snippet: boolean option

    This release adds a feature to address BrowserSync/browser-sync#1882

    Sometimes you don't want Browsersync to auto-inject it's connection snippet into your HTML - now you can disable it globally via either a CLI param or the new snippet option :)

    browser-sync . --no-snippet
    

    or in any Browsersync configuration

    const config = {
      snippet: false,
    };
    

    the original request was related to Eleventy usage, so here's how that would look

    eleventyConfig.setBrowserSyncConfig({
      snippet: false,
    });
    

    ... (truncated)

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 0
  • Bump minimatch and gulp in /public

    Bump minimatch and gulp in /public

    Bumps minimatch to 3.0.4 and updates ancestor dependency gulp. These dependencies need to be updated together.

    Updates minimatch from 0.2.14 to 3.0.4

    Commits
    Maintainer changes

    This version was pushed to npm by isaacs, a new releaser for minimatch since your current version.


    Updates gulp from 3.9.1 to 4.0.2

    Release notes

    Sourced from gulp's releases.

    v4.0.2

    Fix

    Docs

    • Add notes about esm support (4091bd3) - Closes #2278
    • Fix the Negative Globs section & examples (3c66d95) - Closes #2297
    • Remove next tag from recipes (1693a11) - Closes #2277
    • Add default task wrappers to Watching Files examples to make runnable (d916276) - Closes #2322
    • Fix syntax error in lastRun API docs (ea52a92) - Closes #2315
    • Fix typo in Explaining Globs (5d81f42) - Closes #2326

    Build

    • Add node 12 to Travis & Azure (b4b5a68)

    v4.0.1

    Fix

    Docs

    • Fix error in ES2015 usage example (a4e8d48) - Closes #2099 #2100
    • Add temporary notice for 4.0.0 vs 3.9.1 documentation (126423a) - Closes #2121
    • Improve recipe for empty glob array (45830cf) - Closes #2122
    • Reword standard to default (b065a13)
    • Fix recipe typo (86acdea) - Closes #2156
    • Add front-matter to each file (d693e49) - Closes #2109
    • Rename "Getting Started" to "Quick Start" & update it (6a0fa00)
    • Add "Creating Tasks" documentation (21b6962)
    • Add "JavaScript and Gulpfiles" documentation (31adf07)
    • Add "Working with Files" documentation (50fafc6)
    • Add "Async Completion" documentation (ad8b568)
    • Add "Explaining Globs" documentation (f8cafa0)
    • Add "Using Plugins" documentation (233c3f9)
    • Add "Watching Files" documentation (f3f2d9f)
    • Add Table of Contents to "Getting Started" directory (a43caf2)
    • Improve & fix parts of Getting Started (84b0234)
    • Create and link-to a "docs missing" page for LINK_NEEDED references (2bd75d0)
    • Redirect users to new Getting Started guides (53e9727)
    • Temporarily reference gulp@next in Quick Start (2cecf1e)
    • Fixed a capitalization typo in a heading (3d051d8) - Closes #2242
    • Use h2 headers within Quick Start documentation (921312c) - Closes #2241
    • Fix for nested directories references (4c2b9a7)
    • Add some more cleanup for Docusaurus (6a8fd8f)
    • Temporarily point LINK_NEEDED references to documentation-missing.md (df7cdcb)
    • API documentation improvements based on feedback (0a68710)

    ... (truncated)

    Changelog

    Sourced from gulp's changelog.

    gulp changelog

    4.0.0

    Task system changes

    • replaced 3.x task system (orchestrator) with new task system (bach)
      • removed gulp.reset
      • removed 3 argument syntax for gulp.task
      • gulp.task should only be used when you will call the task with the CLI
      • added gulp.series and gulp.parallel methods for composing tasks. Everything must use these now.
      • added single argument syntax for gulp.task which allows a named function to be used as the name of the task and task function.
      • added gulp.tree method for retrieving the task tree. Pass { deep: true } for an archy compatible node list.
      • added gulp.registry for setting custom registries.

    CLI changes

    • split CLI out into a module if you want to save bandwidth/disk space. you can install the gulp CLI using either npm install gulp -g or npm install gulp-cli -g, where gulp-cli is the smaller one (no module code included)
    • add --tasks-json flag to CLI to dump the whole tree out for other tools to consume
    • added --verify flag to check the dependencies in package.json against the plugin blacklist.

    vinyl/vinyl-fs changes

    • added gulp.symlink which functions exactly like gulp.dest, but symlinks instead.
    • added dirMode param to gulp.dest and gulp.symlink which allows better control over the mode of the destination folder that is created.
    • globs passed to gulp.src will be evaluated in order, which means this is possible gulp.src(['*.js', '!b*.js', 'bad.js']) (exclude every JS file that starts with a b except bad.js)
    • performance for gulp.src has improved massively
      • gulp.src(['**/*', '!b.js']) will no longer eat CPU since negations happen during walking now
    • added since option to gulp.src which lets you only match files that have been modified since a certain date (for incremental builds)
    • fixed gulp.src not following symlinks
    • added overwrite option to gulp.dest which allows you to enable or disable overwriting of existing files
    Commits
    • 069350a Release: 4.0.2
    • b4b5a68 Build: Add node 12 to Travis & Azure
    • 5667666 Fix: Bind src/dest/symlink to the gulp instance to support esm exports (ref s...
    • 4091bd3 Docs: Add notes about esm support (closes #2278)
    • 3c66d95 Docs: Fix the Negative Globs section & examples (closes #2297)
    • 1693a11 Docs: Remove next tag from recipes (closes #2277)
    • d916276 Docs: Add default task wrappers to Watching Files examples to make runnable (...
    • ea52a92 Docs: Fix syntax error in lastRun API docs (closes #2315)
    • 5d81f42 Docs: Fix typo in Explaining Globs (#2326)
    • ea3bba4 Release: 4.0.1
    • Additional commits viewable in compare view

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 0
  • Bump socket.io-parser and browser-sync in /public

    Bump socket.io-parser and browser-sync in /public

    Bumps socket.io-parser to 4.2.1 and updates ancestor dependency browser-sync. These dependencies need to be updated together.

    Updates socket.io-parser from 3.1.3 to 4.2.1

    Release notes

    Sourced from socket.io-parser's releases.

    4.2.1

    Bug Fixes

    • check the format of the index of each attachment (b5d0cb7)

    Links

    4.2.0

    Features

    • allow the usage of custom replacer and reviver (#112) (b08bc1a)

    Links

    4.1.2

    Bug Fixes

    • allow objects with a null prototype in binary packets (#114) (7f6b262)

    Links

    4.1.1

    Links

    4.1.0

    Features

    • provide an ESM build with and without debug (388c616)

    Links

    4.0.5

    Bug Fixes

    • check the format of the index of each attachment (b559f05)

    Links

    ... (truncated)

    Changelog

    Sourced from socket.io-parser's changelog.

    4.2.1 (2022-06-27)

    Bug Fixes

    • check the format of the index of each attachment (b5d0cb7)

    4.2.0 (2022-04-17)

    Features

    • allow the usage of custom replacer and reviver (#112) (b08bc1a)

    4.1.2 (2022-02-17)

    Bug Fixes

    • allow objects with a null prototype in binary packets (#114) (7f6b262)

    4.1.1 (2021-10-14)

    4.1.0 (2021-10-11)

    Features

    • provide an ESM build with and without debug (388c616)

    4.0.4 (2021-01-15)

    Bug Fixes

    • allow integers as event names (1c220dd)

    4.0.3 (2021-01-05)

    4.0.2 (2020-11-25)

    ... (truncated)

    Commits
    • 5a2ccff chore(release): 4.2.1
    • b5d0cb7 fix: check the format of the index of each attachment
    • c7514b5 chore(release): 4.2.0
    • 931f152 chore: add Node.js 16 in the test matrix
    • 6c9cb27 chore: bump @​socket.io/component-emitter to version 3.1.0
    • b08bc1a feat: allow the usage of custom replacer and reviver (#112)
    • aed252c chore(release): 4.1.2
    • 89209fa chore: bump cached-path-relative from 1.0.2 to 1.1.0 (#113)
    • 0a3b556 chore: bump path-parse from 1.0.6 to 1.0.7 (#108)
    • 7f6b262 fix: allow objects with a null prototype in binary packets (#114)
    • Additional commits viewable in compare view

    Updates browser-sync from 2.24.7 to 2.27.10

    Release notes

    Sourced from browser-sync's releases.

    2.27.9

    What's Changed

    A bug prevented the help output from displaying - it was introduced when the CLI parser yargs was updated, and is now fixed :)

    Full Changelog: https://github.com/BrowserSync/browser-sync/compare/v2.27.8...v2.27.9

    2.27.8

    This release upgrades Socket.io (client+server) to the latest versions - solving the following issues, and silencing security warning :)

    PR:

    Resolved Issues:

    Thanks to @​lachieh for the original PR, which helped me land this fix

    added snippet: boolean option

    This release adds a feature to address BrowserSync/browser-sync#1882

    Sometimes you don't want Browsersync to auto-inject it's connection snippet into your HTML - now you can disable it globally via either a CLI param or the new snippet option :)

    browser-sync . --no-snippet
    

    or in any Browsersync configuration

    const config = {
      snippet: false,
    };
    

    the original request was related to Eleventy usage, so here's how that would look

    eleventyConfig.setBrowserSyncConfig({
      snippet: false,
    });
    

    ... (truncated)

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 0
Releases(v0.4.3)
Owner
Camelot and Excalibur: PDF Table Extraction for Humans
null
Pdfencrypt is a tool to encrypt/lock PDFs

Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:

Anontemitayo 5 Nov 28, 2021
Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

Bo-Yu 4 Dec 5, 2021
pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc

Will Angley 2 Dec 17, 2021
Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.

Marshal Miller 22 Nov 21, 2022
A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements ?? Your system must have the f

Aman Nirala 3 Apr 23, 2022
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

null 3 Nov 25, 2021
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

null 1.8k Dec 29, 2022
WeasyPrint is a smart solution helping web developers to create PDF documents.

WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…

Kozea 5.4k Jan 8, 2023
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

Jacobo José Guijarro Villalba 75 Oct 21, 2022
Extract tables from scanned image PDFs using Optical Character Recognition.

ocr-table This project aims to extract tables from scanned image PDFs using Optical Character Recognition. Install Requirements Tesseract OCR sudo apt

Abhijeet Singh 209 Dec 6, 2022
Camelot is a Python library that can help you extract tables from PDFs!

A Python library to extract tabular data from PDFs

null 1.8k Jan 3, 2023
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

ArchiveBox Open-source self-hosted web archiving. ▶️ Quickstart | Demo | Github | Documentation | Info & Motivation | Community | Roadmap "Your own pe

ArchiveBox 14.8k Jan 5, 2023
Reads Data from given Excel File and exports Single PDFs and a complete PDF grouped by Gateway

E-Shelter Excel2QR Reads Data from given Excel File and exports Single PDFs and a complete PDF grouped by Gateway Features Reads Excel 2021 Export Sin

Stefan Knaak 1 Nov 13, 2021
A library for converting HTML into PDFs using ReportLab

XHTML2PDF The current release of xhtml2pdf is xhtml2pdf 0.2.5. Release Notes can be found here: Release Notes As with all open-source software, its us

null 2k Dec 27, 2022
Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

doc2text doc2text extracts higher quality text by fixing common scan errors Developing text corpora can be a massive pain in the butt. Much of the tex

Joe Sutherland 1.3k Jan 4, 2023
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 7, 2022
A toolkit to automatically crawl the paper list and download paper pdfs of ACL Ahthology.

ACL-Anthology-Crawler A toolkit to automatically crawl the paper list and download paper pdfs of ACL Anthology

Ray GG 9 Oct 9, 2022
Pdfencrypt is a tool to encrypt/lock PDFs

Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:

Anontemitayo 5 Nov 28, 2021
Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

Bo-Yu 4 Dec 5, 2021