Tools for parsing messy tabular data.

Overview
Comments
  • HTMLTableSet

    HTMLTableSet

    Hi, here's a HTML Table Set importer for messytables.

    It's not fantastic yet; but it's a pretty good start

    • Supports rowspan/colspan - currently by inserting blank cells.
    • Supports multiple TABLE elements - but may have unexpected behaviour where there are nested tables.
    • Doesn't attempt to handle tables that aren't using TABLE, TR, TD, TH.
    • Not enormously well tested, but seems to work on the tables I've fed it so far.
    • Requires lxml.

    It's the first time I've ever made a pull request; let us know if there's anything we can do to improve it for you.

    opened by scraperdragon 12
  • All releases BROKEN due to json-table-schema name change

    All releases BROKEN due to json-table-schema name change

    json-table-schema is a broken dependency as of yesterday. This affects current and previous releases on pypi.

    To fix this at this end we've changed the dep https://github.com/okfn/messytables/pull/143 and now messytables installs from source again, but it needs a release to pypi. I don't have permission for this.

    (test)co@precise64:/tmp$ pip install messytables
    Downloading/unpacking messytables
      Downloading messytables-0.15.0.tar.gz
      Running setup.py egg_info for package messytables
    
    Downloading/unpacking xlrd>=0.8.0 (from messytables)
      Downloading xlrd-0.9.4.tar.gz (322Kb): 322Kb downloaded
      Running setup.py egg_info for package xlrd
    
    Downloading/unpacking python-magic>=0.4.6 (from messytables)
      Downloading python-magic-0.4.10.tar.gz
      Running setup.py egg_info for package python-magic
    
        no previously-included directories found matching 'test'
    Downloading/unpacking chardet>=2.3.0 (from messytables)
      Downloading chardet-2.3.0.tar.gz (164Kb): 164Kb downloaded
      Running setup.py egg_info for package chardet
    
        warning: no files found matching 'COPYING'
        warning: no files found matching '*.html' under directory 'docs'
        warning: no files found matching '*.css' under directory 'docs'
        warning: no files found matching '*.png' under directory 'docs'
        warning: no files found matching '*.gif' under directory 'docs'
    Downloading/unpacking python-dateutil>=2.4.2 (from messytables)
      Downloading python-dateutil-2.4.2.tar.gz (209Kb): 209Kb downloaded
      Running setup.py egg_info for package python-dateutil
    
    Downloading/unpacking lxml>=3.2 (from messytables)
      Downloading lxml-3.5.0b1.tar.gz (3.8Mb): 3.8Mb downloaded
      Running setup.py egg_info for package lxml
        Building lxml version 3.5.0b1.
        Building without Cython.
        Using build configuration of libxslt 1.1.26
        Building against libxml2/libxslt in the following directory: /usr/lib/x86_64-linux-gnu
    
        warning: no previously-included files found matching '*.py'
    Downloading/unpacking requests (from messytables)
      Downloading requests-2.8.1.tar.gz (480Kb): 480Kb downloaded
      Running setup.py egg_info for package requests
    
    Downloading/unpacking html5lib (from messytables)
      Downloading html5lib-1.0b8.tar.gz (889Kb): 889Kb downloaded
      Running setup.py egg_info for package html5lib
    
    Downloading/unpacking json-table-schema>=0.2 (from messytables)
      Downloading json-table-schema-0.5.0.tar.gz
      Running setup.py egg_info for package json-table-schema
        json-table-schema has been replaced by jsontableschema. See https://github.com/okfn/json-table-schema-py-old for details.
        Traceback (most recent call last):
          File "<string>", line 14, in <module>
          File "/tmp/test/build/json-table-schema/setup.py", line 16, in <module>
            with io.open(README_PATH, mode='r+t', encoding='utf-8') as stream:
        IOError: [Errno 2] No such file or directory: '/tmp/test/build/json-table-schema/README.md'
        Complete output from command python setup.py egg_info:
        json-table-schema has been replaced by jsontableschema. See https://github.com/okfn/json-table-schema-py-old for details.
    
    Traceback (most recent call last):
    
      File "<string>", line 14, in <module>
    
      File "/tmp/test/build/json-table-schema/setup.py", line 16, in <module>
    
        with io.open(README_PATH, mode='r+t', encoding='utf-8') as stream:
    
    IOError: [Errno 2] No such file or directory: '/tmp/test/build/json-table-schema/README.md'
    
    ----------------------------------------
    Command python setup.py egg_info failed with error code 1 in /tmp/test/build/json-table-schema
    Storing complete log in /home/co/.pip/pip.log
    
    opened by davidread 11
  • Getting messytables to run on Python 3

    Getting messytables to run on Python 3

    Does any know, informally or otherwise, what it will take to get messytables running on Python 3?

    I'm keen to use various functions and modules from messytables, but I'm trying to maintain 2.7/3.3/3.4 support in my own libraries.

    opened by pwalsh 11
  • Application for maintainership

    Application for maintainership

    Hey all. This repository seems to be semi-inactive, and it unclear to me what the path to merging a PR like #171 is (who would have to approve?). I use messytables in production code day to day, and this lack of clarity on process makes the library a liability. My understanding is that okfn's resources and interest is focussed on goodtables and the frictionlessdata toolchain.

    I would therefore like to apply to become the maintainer for messytables, merge #171 & co., and generally make sure that changes in this thing are handled and bugs are actively tracked.

    Thoughts, @pwalsh, @davidread, @rufuspollock? Please let me know.

    opened by pudo 10
  • TypeError(

    TypeError("object of type 'float' has no len()",) when calling type_guess

    I could trace this back to #141 where len() is being used in the test() method of DateUtilType.

    I think there should be a try/except block around that, that catches this TypeError. But I'm not too familiar with the code, so I'm basically asking if you agree, or if I'm missing something.

    I'm happy to provide the PR.

    BTW: I'm getting this error via datapusher on some Excel sheet that is being parsed with the default parameters. The excel sheet has indeed a lot of float values in it.

    opened by metaodi 10
  • [discussion] messytables should *only* work with local files

    [discussion] messytables should *only* work with local files

    Messytables doesn't work well in a lot of situations when the provided fileobj is a socket.

    The BufferedFile object attempts to resolve this, but in a lot of cases it will force a read(-1) and cause a complete download of the file (into ram) anyway. This is particularly true of anything that that wants to seek within the file (such as zip and xls) or the buffer passed to magic.from_buffer (which is inadequate in some cases and from_file would be more accurate).

    Downloading the content to temporary storage isn't an onerous task, and if the interface was modified to use filenames instead of file-objects it could even transparently download the content when a url is provided (which is is destined to do anyway at some point).

    question 
    opened by rossjones 10
  • Support for PDF format

    Support for PDF format

    We've been exploring different options for parsing PDFs. Currently we're using an (alpha) in-house library called pdftables (we blogged about it here)

    This pull request integrates pdftables into messytables. It is an optional requirement - if pdftables is not installed, messytables will work as usual and the PDF tests will be skipped.

    We're looking into other ways of extracting tables from PDFs, but either way we'll need the messytables integration.

    opened by fawkesley 9
  • [WIP] Support for ODS files.

    [WIP] Support for ODS files.

    A reworked reader for ODS files that doesn't use any broken third-party libraries. Reads the .xml directly from the zipfile and performs much better on larger spreadsheets.

    opened by rossjones 9
  • libmagic error following messytables overview

    libmagic error following messytables overview

    I'm based off of http://messytables.readthedocs.org/en/latest/ but have also looked at the GitHub readme, etc. Couldn't find any actual install instructions anywhere, but here's what I did.

    Environment: Mac OS X latest, up to date homebrew

    1. pip install messytables

    2. brew install libmagic

    3. The following Python:

      % python                
      Python 2.7.6 (default, Nov 14 2013, 09:55:56) 
      [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import messytables
      >>> messytables.any_tableset(open('README.txt', 'rb'))
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/local/lib/python2.7/site-packages/messytables/any.py", line 138, in any_tableset
          magic_mime = get_mime(fileobj)
        File "/usr/local/lib/python2.7/site-packages/messytables/any.py", line 38, in get_mime
          mimetype = magic.from_buffer(header, mime=True)
        File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 103, in from_buffer
          def __init__(self, ms):
        File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 94, in _get_magic_type
          _list = _libraries['magic'].magic_list
        File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 83, in _get_magic_mime
          _load.restype = c_int
        File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 51, in __init__
          magic_set._fields_ = []
        File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 138, in errorcheck
          except:
      magic.MagicException: no magic files loaded
      
    4. README.txt:

      ============
      README
      ============
      
      A single-line README file.
      
    opened by dhalperi 8
  • Remove openpyxl, use XLSTableSet for XLSX files

    Remove openpyxl, use XLSTableSet for XLSX files

    Phase 1 of 2 for completely removing openpyxl and using XLSTableSet instead. (Phase 2 will actually remove the dependency and excelx.py, then you won't be able to reference XLSXTableSet)

    If you always use any_tableset it'll just work correctly - you'll now get back an XLSTableSet instead of an XLSXTableSet.

    I've left the latter in with a DeprecationWarning (and test) in order to remain compatible with code written with explicity XLSXTableSet.

    I'm feeling like we should encourage people towards only using any_tableset (perhaps with an argument to override force the type detection). It's quite awkward that currently our users are needlessly coupling to our class naming convention. Unless I've missed a use-case - any compelling reasons to allow that?

    Not ready to merge yet I suspect. Closes #83

    opened by fawkesley 8
  • 65 rework of detection in any.py

    65 rework of detection in any.py

    We were having problems with any.py, so I rewrote it.

    Features:

    • new extension detection function (you can pass a whole filename/URL)
    • nice lists of mimetypes/extensions parsed
    • special pleading for XLS/XLSX files :(
    • tests for autodetection
    • various fixes
    opened by scraperdragon 8
  • Failure to load with Python 3.10

    Failure to load with Python 3.10

    Attempting to use messytables with Python 3.10 results in the following error:

      File "/layers/google.python.pip/pip/lib/python3.10/site-packages/messytables/core.py", line 2, in <module>
        from collections import Mapping
    ImportError: cannot import name 'Mapping' from 'collections' (/opt/python3.10/lib/python3.10/collections/__init__.py)
    

    This is due to Mapping moving to package collections.abc in Python 3.10.

    core.py should be updated to take account of this.

    opened by davidharcombe 0
  • Bump lxml from 4.3.4 to 4.9.1

    Bump lxml from 4.3.4 to 4.9.1

    Bumps lxml from 4.3.4 to 4.9.1.

    Changelog

    Sourced from lxml's changelog.

    4.9.1 (2022-07-01)

    Bugs fixed

    • A crash was resolved when using iterwalk() (or canonicalize()) after parsing certain incorrect input. Note that iterwalk() can crash on valid input parsed with the same parser after failing to parse the incorrect input.

    4.9.0 (2022-06-01)

    Bugs fixed

    • GH#341: The mixin inheritance order in lxml.html was corrected. Patch by xmo-odoo.

    Other changes

    • Built with Cython 0.29.30 to adapt to changes in Python 3.11 and 3.12.

    • Wheels include zlib 1.2.12, libxml2 2.9.14 and libxslt 1.1.35 (libxml2 2.9.12+ and libxslt 1.1.34 on Windows).

    • GH#343: Windows-AArch64 build support in Visual Studio. Patch by Steve Dower.

    4.8.0 (2022-02-17)

    Features added

    • GH#337: Path-like objects are now supported throughout the API instead of just strings. Patch by Henning Janssen.

    • The ElementMaker now supports QName values as tags, which always override the default namespace of the factory.

    Bugs fixed

    • GH#338: In lxml.objectify, the XSI float annotation "nan" and "inf" were spelled in lower case, whereas XML Schema datatypes define them as "NaN" and "INF" respectively.

    ... (truncated)

    Commits
    • d01872c Prevent parse failure in new test from leaking into later test runs.
    • d65e632 Prepare release of lxml 4.9.1.
    • 86368e9 Fix a crash when incorrect parser input occurs together with usages of iterwa...
    • 50c2764 Delete unused Travis CI config and reference in docs (GH-345)
    • 8f0bf2d Try to speed up the musllinux AArch64 build by splitting the different CPytho...
    • b9f7074 Remove debug print from test.
    • b224e0f Try to install 'xz' in wheel builds, if available, since it's now needed to e...
    • 897ebfa Update macOS deployment target version from 10.14 to 10.15 since 10.14 starts...
    • 853c9e9 Prepare release of 4.9.0.
    • d3f77e6 Add a test for https://bugs.launchpad.net/lxml/+bug/1965070 leaving out the a...
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • messytables guesses wrong type for decimal number

    messytables guesses wrong type for decimal number

    Describe the bug Messytables should guess decimals correctly respecting the locale configuration. For example: In germany the , is used as decimal dot but a value 1,200 is guessed as type "text".

    This issue was initially reported as ckan issue https://github.com/ckan/ckan/issues/5769 where I recognized it.

    The type guessing seems to happen here: https://github.com/okfn/messytables/blob/51b736892a48e420ab313675f54901c77b446dec/messytables/types.py and seems to happen locale specific. (I think the magic happens in line 100: value = locale.atof(value)

    Unfortunately python seems to recognizes a dot as decimal point even if a german locale is set, which I could reproduce in my local environment:

    >>> locale.getlocale()
    ('de_DE', 'cp1252')
    >>> locale.atof('1,200')
    
    Traceback (most recent call last):
      File "<pyshell#35>", line 1, in <module>
        locale.atof('1,200')
      File "C:\Program Files\Python27\lib\locale.py", line 318, in atof
        return func(string)
    ValueError: invalid literal for float(): 1,200
    >>> locale.localeconv()
    {'mon_decimal_point': '', 'int_frac_digits': 127, 'p_sep_by_space': 127, 'frac_digits': 127, 'thousands_sep': '', 'n_sign_posn': 127, 'decimal_point': '.', 'int_curr_symbol': '', 'n_cs_precedes': 127, 'p_sign_posn': 127, 'mon_thousands_sep': '', 'negative_sign': '', 'currency_symbol': '', 'n_sep_by_space': 127, 'mon_grouping': [], 'p_cs_precedes': 127, 'positive_sign': '', 'grouping': []}
    
    opened by wrinklenose 1
  • test_attempt_read_encrypted_no_password_xls failure in Python 3.7+

    test_attempt_read_encrypted_no_password_xls failure in Python 3.7+

    This line specifies an error message. In the test, the text of the exception caused by the code under test is expected to match exactly.

    errmsg = "Can't read Excel file: XLRDError('Workbook is encrypted',)"
    

    When running tests on Python 3.7 and 3.8 this fails, because their outputs do not contain the comma (probably due to this change in Python 3.7, I'm guessing).

    opened by StevenMaude 0
  • requirements-test.txt should have xlrd==1.2.0 (or >=) for Python 3.8+ tests

    requirements-test.txt should have xlrd==1.2.0 (or >=) for Python 3.8+ tests

    opened by StevenMaude 0
Releases(0.15.1)
Owner
Open Knowledge Foundation
Also find us at: @frictionlessdata @opentrials @openspending @openknowledge-archive
Open Knowledge Foundation
functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

Coursera 188 Nov 24, 2022
Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

Eric Ma 1.1k Jan 1, 2023
Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

Prodmodel 53 Nov 29, 2022
Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hosted in the cloud. This project, demonstrates how to use these services to manage data we collect from different sources.

Riya Vijay Vishwakarma 1 Dec 12, 2021
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

Command line utilities for tabular data files This is a set of command line utilities for manipulating large tabular data files. Files of numeric and

eBay 1.4k Jan 9, 2023
Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

Fast MST Algorithm Implementation of fast algorithms for (Maximum Spanning Tree) MST parsing that includes fast ArcMax+Reweighting+Tarjan algorithm fo

Miloš Stanojević 11 Oct 14, 2022
Course-parsing - Parsing Course Info for NIT Kurukshetra

Parsing Course Info for NIT Kurukshetra Overview This repository houses code for

Saksham Mittal 3 Feb 3, 2022
A collection of robust and fast processing tools for parsing and analyzing web archive data.

ChatNoir Resiliparse A collection of robust and fast processing tools for parsing and analyzing web archive data. Resiliparse is part of the ChatNoir

ChatNoir 24 Nov 29, 2022
ckan 3.6k Dec 27, 2022
Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

python-tabulate Pretty-print tabular data in Python, a library and a command-line utility. The main use cases of the library are: printing small table

Sergey Astanin 1.5k Jan 6, 2023
Display tabular data in a visually appealing ASCII table format

PrettyTable Installation Install via pip: python -m pip install -U prettytable Install latest development version: python -m pip install -U git+https

Jazzband 924 Jan 5, 2023
A standard framework for modelling Deep Learning Models for tabular data

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike.

null 801 Jan 8, 2023
Implementation of TabTransformer, attention network for tabular data, in Pytorch

Tab Transformer Implementation of Tab Transformer, attention network for tabular data, in Pytorch. This simple architecture came within a hair's bread

Phil Wang 420 Jan 5, 2023
A Python package for manipulating 2-dimensional tabular data structures

datatable This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame

H2O.ai 1.6k Jan 5, 2023
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

vaex io 7.7k Jan 1, 2023
A Python toolkit for processing tabular data

meza: A Python toolkit for processing tabular data Index Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installat

Reuben Cummings 401 Dec 19, 2022
Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

Jainam Shah 243 Dec 30, 2022
Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

Org. Account 165 Dec 31, 2022
Boosted neural network for tabular data

XBNet - Xtremely Boosted Network Boosted neural network for tabular data XBNet is an open source project which is built with PyTorch which tries to co

Tushar Sarkar 175 Jan 4, 2023
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

Gowthami Somepalli 284 Dec 21, 2022