Tools for parsing messy tabular data.

Open Knowledge Foundation

Last update: Nov 10, 2022

Related tags

Pipelines messytables

Overview

Parsing for messy tables

A library for dealing with messy tabular data in several formats, guessing types and detecting headers.

See the documentation at: https://messytables.readthedocs.io

Find the package at: https://pypi.python.org/pypi/messytables

See CONTRIBUTING.md for how to send patches, run tests.

Contact: Open Knowledge Labs - http://okfnlabs.org/contact/. We especially recommend the forum: http://discuss.okfn.org/category/open-knowledge-labs/

Comments

HTMLTableSet
Hi, here's a HTML Table Set importer for messytables.

It's not fantastic yet; but it's a pretty good start

Supports rowspan/colspan - currently by inserting blank cells.

Supports multiple TABLE elements - but may have unexpected behaviour where there are nested tables.

Doesn't attempt to handle tables that aren't using TABLE, TR, TD, TH.

Not enormously well tested, but seems to work on the tables I've fed it so far.

Requires lxml.

It's the first time I've ever made a pull request; let us know if there's anything we can do to improve it for you.
opened by scraperdragon 12

All releases BROKEN due to json-table-schema name change

json-table-schema is a broken dependency as of yesterday. This affects current and previous releases on pypi.

To fix this at this end we've changed the dep https://github.com/okfn/messytables/pull/143 and now messytables installs from source again, but it needs a release to pypi. I don't have permission for this.

(test)co@precise64:/tmp$ pip install messytables
Downloading/unpacking messytables
  Downloading messytables-0.15.0.tar.gz
  Running setup.py egg_info for package messytables

Downloading/unpacking xlrd>=0.8.0 (from messytables)
  Downloading xlrd-0.9.4.tar.gz (322Kb): 322Kb downloaded
  Running setup.py egg_info for package xlrd

Downloading/unpacking python-magic>=0.4.6 (from messytables)
  Downloading python-magic-0.4.10.tar.gz
  Running setup.py egg_info for package python-magic

    no previously-included directories found matching 'test'
Downloading/unpacking chardet>=2.3.0 (from messytables)
  Downloading chardet-2.3.0.tar.gz (164Kb): 164Kb downloaded
  Running setup.py egg_info for package chardet

    warning: no files found matching 'COPYING'
    warning: no files found matching '*.html' under directory 'docs'
    warning: no files found matching '*.css' under directory 'docs'
    warning: no files found matching '*.png' under directory 'docs'
    warning: no files found matching '*.gif' under directory 'docs'
Downloading/unpacking python-dateutil>=2.4.2 (from messytables)
  Downloading python-dateutil-2.4.2.tar.gz (209Kb): 209Kb downloaded
  Running setup.py egg_info for package python-dateutil

Downloading/unpacking lxml>=3.2 (from messytables)
  Downloading lxml-3.5.0b1.tar.gz (3.8Mb): 3.8Mb downloaded
  Running setup.py egg_info for package lxml
    Building lxml version 3.5.0b1.
    Building without Cython.
    Using build configuration of libxslt 1.1.26
    Building against libxml2/libxslt in the following directory: /usr/lib/x86_64-linux-gnu

    warning: no previously-included files found matching '*.py'
Downloading/unpacking requests (from messytables)
  Downloading requests-2.8.1.tar.gz (480Kb): 480Kb downloaded
  Running setup.py egg_info for package requests

Downloading/unpacking html5lib (from messytables)
  Downloading html5lib-1.0b8.tar.gz (889Kb): 889Kb downloaded
  Running setup.py egg_info for package html5lib

Downloading/unpacking json-table-schema>=0.2 (from messytables)
  Downloading json-table-schema-0.5.0.tar.gz
  Running setup.py egg_info for package json-table-schema
    json-table-schema has been replaced by jsontableschema. See https://github.com/okfn/json-table-schema-py-old for details.
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
      File "/tmp/test/build/json-table-schema/setup.py", line 16, in <module>
        with io.open(README_PATH, mode='r+t', encoding='utf-8') as stream:
    IOError: [Errno 2] No such file or directory: '/tmp/test/build/json-table-schema/README.md'
    Complete output from command python setup.py egg_info:
    json-table-schema has been replaced by jsontableschema. See https://github.com/okfn/json-table-schema-py-old for details.

Traceback (most recent call last):

  File "<string>", line 14, in <module>

  File "/tmp/test/build/json-table-schema/setup.py", line 16, in <module>

    with io.open(README_PATH, mode='r+t', encoding='utf-8') as stream:

IOError: [Errno 2] No such file or directory: '/tmp/test/build/json-table-schema/README.md'

----------------------------------------
Command python setup.py egg_info failed with error code 1 in /tmp/test/build/json-table-schema
Storing complete log in /home/co/.pip/pip.log

opened by davidread 11

Getting messytables to run on Python 3

Does any know, informally or otherwise, what it will take to get messytables running on Python 3?

I'm keen to use various functions and modules from messytables, but I'm trying to maintain 2.7/3.3/3.4 support in my own libraries.

opened by pwalsh 11
Application for maintainership

Hey all. This repository seems to be semi-inactive, and it unclear to me what the path to merging a PR like #171 is (who would have to approve?). I use messytables in production code day to day, and this lack of clarity on process makes the library a liability. My understanding is that okfn's resources and interest is focussed on goodtables and the frictionlessdata toolchain.

I would therefore like to apply to become the maintainer for messytables, merge #171 & co., and generally make sure that changes in this thing are handled and bugs are actively tracked.

Thoughts, @pwalsh, @davidread, @rufuspollock? Please let me know.

opened by pudo 10
TypeError("object of type 'float' has no len()",) when calling type_guess

I could trace this back to #141 where len() is being used in the test() method of DateUtilType.

I think there should be a try/except block around that, that catches this TypeError. But I'm not too familiar with the code, so I'm basically asking if you agree, or if I'm missing something.

I'm happy to provide the PR.

BTW: I'm getting this error via datapusher on some Excel sheet that is being parsed with the default parameters. The excel sheet has indeed a lot of float values in it.

opened by metaodi 10
[discussion] messytables should *only* work with local files

Messytables doesn't work well in a lot of situations when the provided fileobj is a socket.

The BufferedFile object attempts to resolve this, but in a lot of cases it will force a read(-1) and cause a complete download of the file (into ram) anyway. This is particularly true of anything that that wants to seek within the file (such as zip and xls) or the buffer passed to magic.from_buffer (which is inadequate in some cases and from_file would be more accurate).

Downloading the content to temporary storage isn't an onerous task, and if the interface was modified to use filenames instead of file-objects it could even transparently download the content when a url is provided (which is is destined to do anyway at some point).
question

opened by rossjones 10
Support for PDF format

We've been exploring different options for parsing PDFs. Currently we're using an (alpha) in-house library called pdftables (we blogged about it here)

This pull request integrates pdftables into messytables. It is an optional requirement - if pdftables is not installed, messytables will work as usual and the PDF tests will be skipped.

We're looking into other ways of extracting tables from PDFs, but either way we'll need the messytables integration.

opened by fawkesley 9
[WIP] Support for ODS files.

A reworked reader for ODS files that doesn't use any broken third-party libraries. Reads the .xml directly from the zipfile and performs much better on larger spreadsheets.

opened by rossjones 9

libmagic error following messytables overview

I'm based off of http://messytables.readthedocs.org/en/latest/ but have also looked at the GitHub readme, etc. Couldn't find any actual install instructions anywhere, but here's what I did.

Environment: Mac OS X latest, up to date homebrew

pip install messytables
brew install libmagic

The following Python:

% python                
Python 2.7.6 (default, Nov 14 2013, 09:55:56) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import messytables
>>> messytables.any_tableset(open('README.txt', 'rb'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/messytables/any.py", line 138, in any_tableset
    magic_mime = get_mime(fileobj)
  File "/usr/local/lib/python2.7/site-packages/messytables/any.py", line 38, in get_mime
    mimetype = magic.from_buffer(header, mime=True)
  File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 103, in from_buffer
    def __init__(self, ms):
  File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 94, in _get_magic_type
    _list = _libraries['magic'].magic_list
  File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 83, in _get_magic_mime
    _load.restype = c_int
  File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 51, in __init__
    magic_set._fields_ = []
  File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 138, in errorcheck
    except:
magic.MagicException: no magic files loaded

README.txt:

============
README
============

A single-line README file.

opened by dhalperi 8

Remove openpyxl, use XLSTableSet for XLSX files

Phase 1 of 2 for completely removing openpyxl and using XLSTableSet instead. (Phase 2 will actually remove the dependency and excelx.py, then you won't be able to reference XLSXTableSet)

If you always use any_tableset it'll just work correctly - you'll now get back an XLSTableSet instead of an XLSXTableSet.

I've left the latter in with a DeprecationWarning (and test) in order to remain compatible with code written with explicity XLSXTableSet.

I'm feeling like we should encourage people towards only using any_tableset (perhaps with an argument to override force the type detection). It's quite awkward that currently our users are needlessly coupling to our class naming convention. Unless I've missed a use-case - any compelling reasons to allow that?

Not ready to merge yet I suspect. Closes #83

opened by fawkesley 8
65 rework of detection in any.py
We were having problems with any.py, so I rewrote it.

Features:

new extension detection function (you can pass a whole filename/URL)

nice lists of mimetypes/extensions parsed

special pleading for XLS/XLSX files :(

tests for autodetection

various fixes
opened by scraperdragon 8

Failure to load with Python 3.10

Attempting to use messytables with Python 3.10 results in the following error:

  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/messytables/core.py", line 2, in <module>
    from collections import Mapping
ImportError: cannot import name 'Mapping' from 'collections' (/opt/python3.10/lib/python3.10/collections/__init__.py)

This is due to Mapping moving to package collections.abc in Python 3.10.

core.py should be updated to take account of this.

opened by davidharcombe 0

Bump lxml from 4.3.4 to 4.9.1
Bumps lxml from 4.3.4 to 4.9.1.

Changelog

Sourced from lxml's changelog.

4.9.1 (2022-07-01)

Bugs fixed

A crash was resolved when using iterwalk() (or canonicalize()) after parsing certain incorrect input. Note that iterwalk() can crash on valid input parsed with the same parser after failing to parse the incorrect input.

4.9.0 (2022-06-01)

Bugs fixed

GH#341: The mixin inheritance order in lxml.html was corrected. Patch by xmo-odoo.

Other changes

Built with Cython 0.29.30 to adapt to changes in Python 3.11 and 3.12.

Wheels include zlib 1.2.12, libxml2 2.9.14 and libxslt 1.1.35 (libxml2 2.9.12+ and libxslt 1.1.34 on Windows).

GH#343: Windows-AArch64 build support in Visual Studio. Patch by Steve Dower.

4.8.0 (2022-02-17)

Features added

GH#337: Path-like objects are now supported throughout the API instead of just strings. Patch by Henning Janssen.

The ElementMaker now supports QName values as tags, which always override the default namespace of the factory.

Bugs fixed

GH#338: In lxml.objectify, the XSI float annotation "nan" and "inf" were spelled in lower case, whereas XML Schema datatypes define them as "NaN" and "INF" respectively.

... (truncated)

Commits

d01872c Prevent parse failure in new test from leaking into later test runs.

d65e632 Prepare release of lxml 4.9.1.

86368e9 Fix a crash when incorrect parser input occurs together with usages of iterwa...

50c2764 Delete unused Travis CI config and reference in docs (GH-345)

8f0bf2d Try to speed up the musllinux AArch64 build by splitting the different CPytho...

b9f7074 Remove debug print from test.

b224e0f Try to install 'xz' in wheel builds, if available, since it's now needed to e...

897ebfa Update macOS deployment target version from 10.14 to 10.15 since 10.14 starts...

853c9e9 Prepare release of 4.9.0.

d3f77e6 Add a test for https://bugs.launchpad.net/lxml/+bug/1965070 leaving out the a...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
messytables guesses wrong type for decimal number
Describe the bug Messytables should guess decimals correctly respecting the locale configuration. For example: In germany the , is used as decimal dot but a value 1,200 is guessed as type "text".

This issue was initially reported as ckan issue https://github.com/ckan/ckan/issues/5769 where I recognized it.

The type guessing seems to happen here: https://github.com/okfn/messytables/blob/51b736892a48e420ab313675f54901c77b446dec/messytables/types.py and seems to happen locale specific. (I think the magic happens in line 100: value = locale.atof(value)

Unfortunately python seems to recognizes a dot as decimal point even if a german locale is set, which I could reproduce in my local environment:

>>> locale.getlocale() ('de_DE', 'cp1252') >>> locale.atof('1,200') Traceback (most recent call last): File "<pyshell#35>", line 1, in <module> locale.atof('1,200') File "C:\Program Files\Python27\lib\locale.py", line 318, in atof return func(string) ValueError: invalid literal for float(): 1,200 >>> locale.localeconv() {'mon_decimal_point': '', 'int_frac_digits': 127, 'p_sep_by_space': 127, 'frac_digits': 127, 'thousands_sep': '', 'n_sign_posn': 127, 'decimal_point': '.', 'int_curr_symbol': '', 'n_cs_precedes': 127, 'p_sign_posn': 127, 'mon_thousands_sep': '', 'negative_sign': '', 'currency_symbol': '', 'n_sep_by_space': 127, 'mon_grouping': [], 'p_cs_precedes': 127, 'positive_sign': '', 'grouping': []}
opened by wrinklenose 1
test_attempt_read_encrypted_no_password_xls failure in Python 3.7+
This line specifies an error message. In the test, the text of the exception caused by the code under test is expected to match exactly.

errmsg = "Can't read Excel file: XLRDError('Workbook is encrypted',)"

When running tests on Python 3.7 and 3.8 this fails, because their outputs do not contain the comma (probably due to this change in Python 3.7, I'm guessing).
opened by StevenMaude 0
requirements-test.txt should have xlrd==1.2.0 (or >=) for Python 3.8+ tests

This version of xlrd is currently pinned for testing on Travis in requirements-test.txt.

Prior to v1.2.0, xlrd used the time.clock() function inside book.py and this was removed in Python 3.8.

opened by StevenMaude 0

Releases(0.15.1)

0.15.1(Sep 29, 2016)

Source code(tar.gz)
Source code(zip)

Owner

Open Knowledge Foundation

Also find us at: @frictionlessdata @opentrials @openspending @openknowledge-archive

GitHub http://messytables.readthedocs.io/

functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

188 Nov 24, 2022

Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

1.1k Jan 1, 2023

Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

53 Nov 29, 2022

Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hosted in the cloud. This project, demonstrates how to use these services to manage data we collect from different sources.

1 Dec 12, 2021

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

Command line utilities for tabular data files This is a set of command line utilities for manipulating large tabular data files. Files of numeric and

1.4k Jan 9, 2023

Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

Fast MST Algorithm Implementation of fast algorithms for (Maximum Spanning Tree) MST parsing that includes fast ArcMax+Reweighting+Tarjan algorithm fo

11 Oct 14, 2022

Course-parsing - Parsing Course Info for NIT Kurukshetra

Parsing Course Info for NIT Kurukshetra Overview This repository houses code for

3 Feb 3, 2022

A collection of robust and fast processing tools for parsing and analyzing web archive data.

ChatNoir Resiliparse A collection of robust and fast processing tools for parsing and analyzing web archive data. Resiliparse is part of the ChatNoir

24 Nov 29, 2022

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

CKAN: The Open Source Data Portal Software CKAN is the world’s leading open-source data portal platform. CKAN makes it easy to publish, share and work

3.6k Dec 27, 2022

Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

python-tabulate Pretty-print tabular data in Python, a library and a command-line utility. The main use cases of the library are: printing small table

1.5k Jan 6, 2023

Display tabular data in a visually appealing ASCII table format

PrettyTable Installation Install via pip: python -m pip install -U prettytable Install latest development version: python -m pip install -U git+https

924 Jan 5, 2023

A standard framework for modelling Deep Learning Models for tabular data

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike.

801 Jan 8, 2023

Implementation of TabTransformer, attention network for tabular data, in Pytorch

Tab Transformer Implementation of Tab Transformer, attention network for tabular data, in Pytorch. This simple architecture came within a hair's bread

420 Jan 5, 2023

A Python package for manipulating 2-dimensional tabular data structures

datatable This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame

1.6k Jan 5, 2023

Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

7.7k Jan 1, 2023

A Python toolkit for processing tabular data

401 Dec 19, 2022

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

243 Dec 30, 2022

Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

165 Dec 31, 2022

Boosted neural network for tabular data

XBNet - Xtremely Boosted Network Boosted neural network for tabular data XBNet is an open source project which is built with PyTorch which tries to co

175 Jan 4, 2023

The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

284 Dec 21, 2022

Tools for parsing messy tabular data.

Related tags

Overview

Parsing for messy tables

Comments

4.9.1 (2022-07-01)

Bugs fixed

4.9.0 (2022-06-01)

Bugs fixed

Other changes

4.8.0 (2022-02-17)

Features added

Bugs fixed

Releases(0.15.1)

0.15.1(Sep 29, 2016)

Owner

Open Knowledge Foundation

functional data manipulation for pandas

Clean APIs for data cleaning. Python implementation of R package Janitor

Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Microsoft Azure provides a wide number of services for managing and storing data

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

Course-parsing - Parsing Course Info for NIT Kurukshetra

A collection of robust and fast processing tools for parsing and analyzing web archive data.

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

Display tabular data in a visually appealing ASCII table format

A standard framework for modelling Deep Learning Models for tabular data

Implementation of TabTransformer, attention network for tabular data, in Pytorch

A Python package for manipulating 2-dimensional tabular data structures

Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

A Python toolkit for processing tabular data

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

Python library to extract tabular data from images and scanned PDFs

Boosted neural network for tabular data

The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training