A Python toolkit for processing tabular data

Overview

meza: A Python toolkit for processing tabular data

travis versions pypi

Index

Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installation | Project Structure | Design Principles | Scripts | Contributing | Credits | More Info | License

Introduction

meza is a Python library for reading and processing tabular data. It has a functional programming style API, excels at reading/writing large files, and can process 10+ file types.

With meza, you can

  • Read csv/xls/xlsx/mdb/dbf files, and more!
  • Type cast records (date, float, text...)
  • Process Uñicôdë text
  • Lazily stream files by default
  • and much more...

Requirements

meza has been tested and is known to work on Python 3.6, 3.7, and 3.8; and PyPy 3.6.

Optional Dependencies

Function Dependency Installation File type / extension
meza.io.read_mdb mdbtools sudo port install mdbtools Microsoft Access / mdb
meza.io.read_html lxml [1] pip install lxml HTML / html
meza.convert.records2array NumPy [2] pip install numpy n/a
meza.convert.records2df pandas pip install pandas n/a

Notes

[1] If lxml isn't present, read_html will default to the builtin Python html reader
[2] records2array can be used without numpy by passing native=True in the function call. This will convert records into a list of native array.array objects.

Motivation

Why I built meza

pandas is great, but installing it isn't exactly a walk in the park, and it doesn't play nice with PyPy. I designed meza to be a lightweight, easy to install, less featureful alternative to pandas. I also optimized meza for low memory usage, PyPy compatibility, and functional programming best practices.

Why you should use meza

meza provides a number of benefits / differences from similar libraries such as pandas. Namely:

For more detailed information, please check-out the FAQ.

Hello World

A simple data processing example is shown below:

First create a simple csv file (in bash)

echo 'col1,col2,col3\nhello,5/4/82,1\none,1/1/15,2\nhappy,7/1/92,3\n' > data.csv

Now we can read the file, manipulate the data a bit, and write the manipulated data back to a new file.

>>> from meza import io, process as pr, convert as cv
>>> from io import open

>>> # Load the csv file
>>> records = io.read_csv('data.csv')

>>> # `records` are iterators over the rows
>>> row = next(records)
>>> row
{'col1': 'hello', 'col2': '5/4/82', 'col3': '1'}

>>> # Let's replace the first row so as not to lose any data
>>> records = pr.prepend(records, row)

# Guess column types. Note: `detect_types` returns a new `records`
# generator since it consumes rows during type detection
>>> records, result = pr.detect_types(records)
>>> {t['id']: t['type'] for t in result['types']}
{'col1': 'text', 'col2': 'date', 'col3': 'int'}

# Now type cast the records. Note: most `meza.process` functions return
# generators, so lets wrap the result in a list to view the data
>>> casted = list(pr.type_cast(records, result['types']))
>>> casted[0]
{'col1': 'hello', 'col2': datetime.date(1982, 5, 4), 'col3': 1}

# Cut out the first column of data and merge the rows to get the max value
# of the remaining columns. Note: since `merge` (by definition) will always
# contain just one row, it is returned as is (not wrapped in a generator)
>>> cut_recs = pr.cut(casted, ['col1'], exclude=True)
>>> merged = pr.merge(cut_recs, pred=bool, op=max)
>>> merged
{'col2': datetime.date(2015, 1, 1), 'col3': 3}

# Now write merged data back to a new csv file.
>>> io.write('out.csv', cv.records2csv(merged))

# View the result
>>> with open('out.csv', 'utf-8') as f:
...     f.read()
'col2,col3\n2015-01-01,3\n'

Usage

meza is intended to be used directly as a Python library.

Usage Index

Reading data

meza can read both filepaths and file-like objects. Additionally, all readers return equivalent records iterators, i.e., a generator of dictionaries with keys corresponding to the column names.

>>> from io import open, StringIO
>>> from meza import io

"""Read a filepath"""
>>> records = io.read_json('path/to/file.json')

"""Read a file like object and de-duplicate the header"""
>>> f = StringIO('col,col\nhello,world\n')
>>> records = io.read_csv(f, dedupe=True)

"""View the first row"""
>>> next(records)
{'col': 'hello', 'col_2': 'world'}

"""Read the 1st sheet of an xls file object opened in text mode."""
# Also, santize the header names by converting them to lowercase and
# replacing whitespace and invalid characters with `_`.
>>> with open('path/to/file.xls', 'utf-8') as f:
...     for row in io.read_xls(f, sanitize=True):
...         # do something with the `row`
...         pass

"""Read the 2nd sheet of an xlsx file object opened in binary mode"""
# Note: sheets are zero indexed
>>> with open('path/to/file.xlsx') as f:
...     records = io.read_xls(f, encoding='utf-8', sheet=1)
...     first_row = next(records)
...     # do something with the `first_row`

"""Read any recognized file"""
>>> records = io.read('path/to/file.geojson')
>>> f.seek(0)
>>> records = io.read(f, ext='csv', dedupe=True)

Please see readers for a complete list of available readers and recognized file types.

Processing data

Numerical analysis (à la pandas) [3]

In the following example, pandas equivalent methods are preceded by -->.

>>> import itertools as it
>>> import random

>>> from io import StringIO
>>> from meza import io, process as pr, convert as cv, stats

# Create some data in the same structure as what the various `read...`
# functions output
>>> header = ['A', 'B', 'C', 'D']
>>> data = [(random.random() for _ in range(4)) for x in range(7)]
>>> df = [dict(zip(header, d)) for d in data]
>>> df[0]
{'A': 0.53908..., 'B': 0.28919..., 'C': 0.03003..., 'D': 0.65363...}

"""Sort records by the value of column `B` --> df.sort_values(by='B')"""
>>> next(pr.sort(df, 'B'))
{'A': 0.53520..., 'B': 0.06763..., 'C': 0.02351..., 'D': 0.80529...}

"""Select column `A` --> df['A']"""
>>> next(pr.cut(df, ['A']))
{'A': 0.53908170489952006}

"""Select the first three rows of data --> df[0:3]"""
>>> len(list(it.islice(df, 3)))
3

"""Select all data whose value for column `A` is less than 0.5
--> df[df.A < 0.5]
"""
>>> next(pr.tfilter(df, 'A', lambda x: x < 0.5))
{'A': 0.21000..., 'B': 0.25727..., 'C': 0.39719..., 'D': 0.64157...}

# Note: since `aggregate` and `merge` (by definition) return just one row,
# they return them as is (not wrapped in a generator).
"""Calculate the mean of column `A` across all data --> df.mean()['A']"""
>>> pr.aggregate(df, 'A', stats.mean)['A']
0.5410437473067938

"""Calculate the sum of each column across all data --> df.sum()"""
>>> pr.merge(df, pred=bool, op=sum)
{'A': 3.78730..., 'C': 2.82875..., 'B': 3.14195..., 'D': 5.26330...}

Text processing (à la csvkit) [4]

In the following example, csvkit equivalent commands are preceded by -->.

First create a few simple csv files (in bash)

echo 'col_1,col_2,col_3\n1,dill,male\n2,bob,male\n3,jane,female' > file1.csv
echo 'col_1,col_2,col_3\n4,tom,male\n5,dick,male\n6,jill,female' > file2.csv

Now we can read the files, manipulate the data, convert the manipulated data to json, and write the json back to a new file. Also, note that since all readers return equivalent records iterators, you can use them interchangeably (in place of read_csv) to open any supported file. E.g., read_xls, read_sqlite, etc.

>>> import itertools as it

>>> from meza import io, process as pr, convert as cv

"""Combine the files into one iterator
--> csvstack file1.csv file2.csv
"""
>>> records = io.join('file1.csv', 'file2.csv')
>>> next(records)
{'col_1': '1', 'col_2': 'dill', 'col_3': 'male'}
>>> next(it.islice(records, 4, None))
{'col_1': '6', 'col_2': 'jill', 'col_3': 'female'}

# Now let's create a persistent records list
>>> records = list(io.read_csv('file1.csv'))

"""Sort records by the value of column `col_2`
--> csvsort -c col_2 file1.csv
"""
>>> next(pr.sort(records, 'col_2'))
{'col_1': '2', 'col_2': 'bob', 'col_3': 'male'

"""Select column `col_2` --> csvcut -c col_2 file1.csv"""
>>> next(pr.cut(records, ['col_2']))
{'col_2': 'dill'}

"""Select all data whose value for column `col_2` contains `jan`
--> csvgrep -c col_2 -m jan file1.csv
"""
>>> next(pr.grep(records, [{'pattern': 'jan'}], ['col_2']))
{'col_1': '3', 'col_2': 'jane', 'col_3': 'female'}

"""Convert a csv file to json --> csvjson -i 4 file1.csv"""
>>> io.write('file.json', cv.records2json(records))

# View the result
>>> with open('file.json', 'utf-8') as f:
...     f.read()
'[{"col_1": "1", "col_2": "dill", "col_3": "male"}, {"col_1": "2",
"col_2": "bob", "col_3": "male"}, {"col_1": "3", "col_2": "jane",
"col_3": "female"}]'

Geo processing (à la mapbox) [5]

In the following example, mapbox equivalent commands are preceded by -->.

First create a geojson file (in bash)

echo '{"type": "FeatureCollection","features": [' > file.geojson
echo '{"type": "Feature", "id": 11, "geometry": {"type": "Point", "coordinates": [10, 20]}},' >> file.geojson
echo '{"type": "Feature", "id": 12, "geometry": {"type": "Point", "coordinates": [5, 15]}}]}' >> file.geojson

Now we can open the file, split the data by id, and finally convert the split data to a new geojson file-like object.

>>> from meza import io, process as pr, convert as cv

# Load the geojson file and peek at the results
>>> records, peek = pr.peek(io.read_geojson('file.geojson'))
>>> peek[0]
{'lat': 20, 'type': 'Point', 'lon': 10, 'id': 11}

"""Split the records by feature ``id`` and select the first feature
--> geojsplit -k id file.geojson
"""
>>> splits = pr.split(records, 'id')
>>> feature_records, name = next(splits)
>>> name
11

"""Convert the feature records into a GeoJSON file-like object"""
>>> geojson = cv.records2geojson(feature_records)
>>> geojson.readline()
'{"type": "FeatureCollection", "bbox": [10, 20, 10, 20], "features": '
'[{"type": "Feature", "id": 11, "geometry": {"type": "Point", '
'"coordinates": [10, 20]}, "properties": {"id": 11}}], "crs": {"type": '
'"name", "properties": {"name": "urn:ogc:def:crs:OGC:1.3:CRS84"}}}'

# Note: you can also write back to a file as shown previously
# io.write('file.geojson', geojson)

Writing data

meza can persist records to disk via the following functions:

  • meza.convert.records2csv
  • meza.convert.records2json
  • meza.convert.records2geojson

Each function returns a file-like object that you can write to disk via meza.io.write('/path/to/file', result).

>>> from meza import io, convert as cv
>>> from io import StringIO, open

# First let's create a simple tsv file like object
>>> f = StringIO('col1\tcol2\nhello\tworld\n')
>>> f.seek(0)

# Next create a records list so we can reuse it
>>> records = list(io.read_tsv(f))
>>> records[0]
{'col1': 'hello', 'col2': 'world'}

# Now we're ready to write the records data to file

"""Create a csv file like object"""
>>> cv.records2csv(records).readline()
'col1,col2\n'

"""Create a json file like object"""
>>> cv.records2json(records).readline()
'[{"col1": "hello", "col2": "world"}]'

"""Write back csv to a filepath"""
>>> io.write('file.csv', cv.records2csv(records))
>>> with open('file.csv', 'utf-8') as f_in:
...     f_in.read()
'col1,col2\nhello,world\n'

"""Write back json to a filepath"""
>>> io.write('file.json', cv.records2json(records))
>>> with open('file.json', 'utf-8') as f_in:
...     f_in.readline()
'[{"col1": "hello", "col2": "world"}]'

Cookbook

Please see the cookbook or ipython notebook for more examples.

Notes

[3] http://pandas.pydata.org/pandas-docs/stable/10min.html#min
[4] https://csvkit.readthedocs.org/en/0.9.1/cli.html#processing
[5] https://github.com/mapbox?utf8=%E2%9C%93&query=geojson

Interoperability

meza plays nicely with NumPy and friends out of the box

setup

from meza import process as pr

# First create some records and types. Also, convert the records to a list
# so we can reuse them.
>>> records = [{'a': 'one', 'b': 2}, {'a': 'five', 'b': 10, 'c': 20.1}]
>>> records, result = pr.detect_types(records)
>>> records, types = list(records), result['types']
>>> types
[
    {'type': 'text', 'id': 'a'},
    {'type': 'int', 'id': 'b'},
    {'type': 'float', 'id': 'c'}]

from records to pandas.DataFrame to records

>>> import pandas as pd
>>> from meza import convert as cv

"""Convert the records to a DataFrame"""
>>> df = cv.records2df(records, types)
>>> df
        a   b   c
0   one   2   NaN
1  five  10  20.1
# Alternatively, you can do `pd.DataFrame(records)`

"""Convert the DataFrame back to records"""
>>> next(cv.df2records(df))
{'a': 'one', 'b': 2, 'c': nan}

from records to arrays to records

>>> import numpy as np

>>> from array import array
>>> from meza import convert as cv

"""Convert records to a structured array"""
>>> recarray = cv.records2array(records, types)
>>> recarray
rec.array([('one', 2, nan), ('five', 10, 20.100000381469727)],
          dtype=[('a', 'O'), ('b', '<i4'), ('c', '<f4')])
>>> recarray.b
array([ 2, 10], dtype=int32)

"""Convert records to a native array"""
>>> narray = cv.records2array(records, types, native=True)
>>> narray
[[array('u', 'a'), array('u', 'b'), array('u', 'c')],
[array('u', 'one'), array('u', 'five')],
array('i', [2, 10]),
array('f', [0.0, 20.100000381469727])]

"""Convert a 2-D NumPy array to a records generator"""
>>> data = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
>>> data
array([[1, 2, 3],
       [4, 5, 6]], dtype=int32)
>>> next(cv.array2records(data))
{'column_1': 1, 'column_2': 2, 'column_3': 3}

"""Convert the structured array back to a records generator"""
>>> next(cv.array2records(recarray))
{'a': 'one', 'b': 2, 'c': nan}

"""Convert the native array back to records generator"""
>>> next(cv.array2records(narray, native=True))
{'a': 'one', 'b': 2, 'c': 0.0}

Installation

(You are using a virtualenv, right?)

At the command line, install meza using either pip (recommended)

pip install meza

or easy_install

easy_install meza

Please see the installation doc for more details.

Project Structure

┌── CONTRIBUTING.rst
├── LICENSE
├── MANIFEST.in
├── Makefile
├── README.rst
├── data
│   ├── converted/*
│   └── test/*
├── dev-requirements.txt
├── docs
│   ├── AUTHORS.rst
│   ├── CHANGES.rst
│   ├── COOKBOOK.rst
│   ├── FAQ.rst
│   ├── INSTALLATION.rst
│   └── TODO.rst
├── examples
│   ├── usage.ipynb
│   └── usage.py
├── helpers/*
├── manage.py
├── meza
│   ├── __init__.py
│   ├── convert.py
│   ├── dbf.py
│   ├── fntools.py
│   ├── io.py
│   ├── process.py
│   ├── stats.py
│   ├── typetools.py
│   └── unicsv.py
├── optional-requirements.txt
├── py2-requirements.txt
├── requirements.txt
├── setup.cfg
├── setup.py
├── tests
│   ├── __init__.py
│   ├── standard.rc
│   ├── test_fntools.py
│   ├── test_io.py
│   └── test_process.py
└── tox.ini

Design Principles

  • prefer functions over objects
  • provide enough functionality out of the box to easily implement the most common data analysis use cases
  • make conversion between records, arrays, and DataFrames dead simple
  • whenever possible, lazily read objects and stream the result [6]
[6] Notable exceptions are meza.process.group, meza.process.sort, meza.io.read_dbf, meza.io.read_yaml, and meza.io.read_html. These functions read the entire contents into memory up front.

Scripts

meza comes with a built in task manager manage.py

Setup

pip install -r dev-requirements.txt

Examples

Run python linter and nose tests

manage lint
manage test

Contributing

Please mimic the coding style/conventions used in this repo. If you add new classes or functions, please add the appropriate doc blocks with examples. Also, make sure the python linter and nose tests pass.

Please see the contributing doc for more details.

Credits

Shoutouts to csvkit, messytables, and pandas for heavily inspiring meza.

More Info

License

meza is distributed under the MIT License.

Comments
  • At the end of the generator, catch StopIteration and return none

    At the end of the generator, catch StopIteration and return none

    When looping through the generator to get the rows/records or converting the record to an array/dataframe, I am getting RuntimeError: generator raised StopIteration. This seems to be a Python 3.7 issue with the PEP 479.

    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/meza/io.py", line 664, in read_mdb
        values = next(csv.reader(next_line, **kwargs))
    StopIteration
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "main.py", line 7, in <module>
        arr = convert.records2array(records, result["types"])
      File "/usr/local/lib/python3.7/site-packages/meza/convert.py", line 710, in records2array
        data = [tuple(r.get(id_) for id_ in ids) for r in records]
      File "/usr/local/lib/python3.7/site-packages/meza/convert.py", line 710, in <listcomp>
        data = [tuple(r.get(id_) for id_ in ids) for r in records]
    RuntimeError: generator raised StopIteration
    

    To fix this issue, when StopIteration is raised in read_mdb(), catch the exception and return None instead of having StopIteration transformed into a RuntimeError.

    opened by christian-ensodata 5
  • Datetimes with all-zero time component (exact midnight) are detected as dates

    Datetimes with all-zero time component (exact midnight) are detected as dates

    Datetime values with all-zero time components are detected as dates by process.detect_types. This is because typetools.is_datetime explicitly checks that the time component is not '00:00:00' https://github.com/reubano/meza/blob/a56b927ba5264e450c8358a1cf5218402a98b6e8/meza/typetools.py#L295 . I'd argue that if a time value is present at all then the value should be treated as a datetime as this likely better represents the intent of the source. For example, a database report may include a datetime column but all the values in a particular output happened to occur at midnight. The down-casting behaviour is similar to that of floats https://github.com/reubano/meza/issues/34.

    Example Test Case

    diff --git a/tests/test_process.py b/tests/test_process.py
    index cc538a2..f90556d 100644
    --- a/tests/test_process.py
    +++ b/tests/test_process.py
    @@ -76,6 +76,12 @@ class Test:
             nt.assert_equal(Decimal('0.87'), result['confidence'])
             nt.assert_false(result['accurate'])
     
    +    def test_detect_types_datetimes_midnight(self):
    +        records = it.repeat({"foo": "2000-01-01 00:00:00"})
    +        records, result = pr.detect_types(records)
    +
    +        nt.assert_equal(result["types"], [{"id": "foo", "type": "datetime"}])
    +
         def test_fillempty(self):
             records = [
                 {'a': '1', 'b': '27', 'c': ''},
    

    Fails with:

    'AssertionError: Lists differ: [{'id': 'foo', 'type': 'date'}] != [{'id': 'foo', 'type': 'datetime'}]
    
    First differing element 0:
    {'id': 'foo', 'type': 'date'}
    {'id': 'foo', 'type': 'datetime'}
    
    - [{'id': 'foo', 'type': 'date'}]
    + [{'id': 'foo', 'type': 'datetime'}]
    ?                             ++++
    

    Making the time part non-zero will pass the test.

    Potential Solutions

    • Prefer stricter type inference for datetimes by default; e.g. if it has both a date and a time field, it's a datetime.
    • Allow stricter type inference for datetimes as an option e.g. a kwarg to detect_types that is passed down to is_datetime to change the behaviour from "can this only be parsed as a datetime" to "this is a datetime"
    • Any other ideas of course!

    I am happy to implement the changes required after a decision is made on the correct behaviour :smile:

    bug 
    opened by SteadBytes 5
  • type casting assumes 'month first' for ambiguous dates

    type casting assumes 'month first' for ambiguous dates

    Right now, the type detection does infer a date, datetime or time types without taking into account the fact that 01/02/2002 can be both Februrary the 1st or January the 2nd depending on the date format used respectively DD/MM/YYYY and MM/DD/YYYY.

    This might be undecidable in some rare cases, but in general it's possible given enough values to decide between both formats.

    One possible way, to handle this in meza is to use a higher level datatype for representing the type of a field to replace the current string representation. For instance:

    datetime_type = namedtuple('DateTimeType', ['format'])
    

    Basically, use a representation that takes optional extra information about the type.

    bug 
    opened by amirouche 3
  • requests >=2.10.0?

    requests >=2.10.0?

    Hello. I'm trying to install meza with pip-tools, but am getting a dependency constraint resolution error between this package and the awsebcli package. meza calls for requests >= 2.10.0, and awsebcli calls for <=2.9.1.

    Is there a specific reason for meza to require >= 2.10.0? Could this requirement be loosened at all, since requests is pretty stable?

    Thanks!

    opened by theunraveler 3
  • Config file for pyup.io

    Config file for pyup.io

    Hi there and thanks for using pyup.io!

    Since you are using a non-default config I've created one for you.

    There are a lot of things you can configure on top of that, so make sure to check out the docs to see what I can do for you.

    opened by pyup-bot 3
  • Fix setup fails

    Fix setup fails

    fix #19 This commit fix setup.py error error in meza setup "command: 'tests_require' must be a string or list of strings containing valid project/version requirement specifiers; Unordered types are not allowed"

    opened by adhaamehab 2
  • In py3.7+, read_mdb should not use StopIteration

    In py3.7+, read_mdb should not use StopIteration

    Here is a simple example:

    from meza.io import read_mdb
    import pandas as pd
    process_data = read_mdb(file_path, "ProcessData")
    process_data_df = pd.DataFrame(process_data)
    process_data.close()
    

    Before py3.7, this works OK. After py3.7, it throws "RuntimeError: generator raised StopIteration". This is due to PEP-479 - https://peps.python.org/pep-0479/ which changed how StopIteration is handled inside generators.

    I tried this with v 0.45.5, which should be py3.7 compatible, and still see the error.

    opened by agileminor 1
  • ValueError converting zero-value currencies

    ValueError converting zero-value currencies

    Type detection raises a ValueError for a currencies with a value of zero e.g. '$0', '0$'.

    >>> import itertools as it
    >>> from meza import process as pr
    >>> 
    >>> records = it.repeat({"money": "$0"})
    >>> records, result = pr.detect_types(records)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/meza/meza/process.py", line 333, in detect_types
        for t in tt.guess_type_by_value(record):
      File "/meza/meza/typetools.py", line 172, in guess_type_by_value
        result = type_test(g['func'], g['type'], key, value)
      File "/meza/meza/typetools.py", line 33, in type_test
        passed = test(value)
      File "//meza/meza/fntools.py", line 509, in is_int
        passed = is_numeric(content, thousand_sep, decimal_sep)
      File "/meza/meza/fntools.py", line 489, in is_numeric
        passed = int(content) == 0
    ValueError: invalid literal for int() with base 10: '$0'
    

    This is caused by is_numeric casting the original, unstripped content to an int: https://github.com/reubano/meza/blob/110f855fa95bcd9665018358059d9df25de1dedf/meza/fntools.py#L489

    As far as I can tell, this should only be an issue when the value starts with 0 and the only non-numeric characters are currency symbols. Here is a failing test case for this:

    diff --git a/tests/test_fntools.py b/tests/test_fntools.py
    index 922bc17..f8cdc75 100644
    --- a/tests/test_fntools.py
    +++ b/tests/test_fntools.py
    @@ -45,6 +45,11 @@ class TestIterStringIO:
             nt.assert_false(ft.is_numeric(None))
             nt.assert_false(ft.is_numeric(''))
     
    +    def test_is_numeric_0_currency(self):
    +        for sym in ft.CURRENCIES:
    +            nt.assert_true(ft.is_numeric(f'0{sym}'))
    +            nt.assert_true(ft.is_numeric(f'{sym}0'))
    +
         def test_is_int(self):
             nt.assert_false(ft.is_int('5/4/82'))
    
    opened by SteadBytes 1
  • Unexepcted warning from iterators

    Unexepcted warning from iterators

    Hello guys,

    I'm having a pletora of warnings from meza and they look like this

    /home/simone/hypothesis-csv/.eggs/meza-0.41.1-py3.6.egg/meza/process.py:342: DeprecationWarning: generator 'gen_types' raised StopIteration types = list(gen_types(tally)) /home/simone/hypothesis-csv/.eggs/meza-0.41.1-py3.6.egg/meza/process.py:335: DeprecationWarning: generator 'guess_type_by_value' raised StopIteration for t in tt.guess_type_by_value(record): /home/simone/hypothesis-csv/.eggs/meza-0.41.1-py3.6.egg/meza/process.py:342: DeprecationWarning: generator 'gen_types' raised StopIteration types = list(gen_types(tally))

    The version is 0.41.1. I checked the code and it actually doesn't raise StopIteration so I have no idea where the error comes from? Any insight?

    can't reproduce 
    opened by chobeat 1
  • Writing using different dialects

    Writing using different dialects

    The documentation provides no examples on how to do this and I cannot find tests that cover this feature. From the code I understand more or less how it should work but I'm not sure. Is there some example somewhere?

    opened by chobeat 1
  • Added quiet bool param for mdb_open to supress out of table names

    Added quiet bool param for mdb_open to supress out of table names

    The first execution of the next(mdb) function will generate output to STDOUT as a side effect of the subprocess.check_process() call. I changed this to a subprocess.check_output() call and just discard the output. I think we're really only looking for the error status.

    I had to reorder some of the other lines to support the quiet boolean parameter.

    opened by petonic 1
  • test failure in test_excel_html_export with io.read_html

    test failure in test_excel_html_export with io.read_html

    Testing meza-0.46.0, I get this error (py38-py310):

    Test for reading an html table exported from excel ... FAIL
    
    ======================================================================
    FAIL: Test for reading an html table exported from excel
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/sw/lib/python3.9/site-packages/nose/case.py", line 197, in runTest
        self.test(*self.arg)
      File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/tests/test_io.py", line 354, in test_excel_html_export
        nt.assert_equal(expected, next(records))
    AssertionError: {'sparse_data': 'Iñtërnâtiônàližætiøn', 'so[61 chars]dam'} != {'13_width_75_some_date': '13 class=xl24 al[123 chars]dam'}
    - {'some_date': '05/04/82',
    + {'13_width_75_some_date': '13 class=xl24 align=right>05/04/82',
    +  '2_width_150_unicode_test': 'Ādam',
    -  'some_value': '234',
    +  '75_some_value': 'right>234',
    ?   +++              ++++++
    
    -  'sparse_data': 'Iñtërnâtiônàližætiøn',
    ?                                       ^
    
    +  '75_sparse_data': 'Iñtërnâtiônàližætiøn'}
    ?   +++                                    ^
    
    -  'unicode_test': 'Ādam'}
    
    ----------------------------------------------------------------------
    

    The output in the AssertionError line seems all mangled with the attributes from the different html table elements sprinkled in. If I remove the html attributes for the table in data/test/test.htm, then the test passes. I notice that io.read_html uses BeautifulSoup. I have beautifulsoup-4.10.0 and soupsieve-2.3.1 installed.

    opened by nieder 0
  • io.py incompatible with PyYAML-6.0

    io.py incompatible with PyYAML-6.0

    When building meza, this test failure happens:

    Doctest: meza.io.read_yaml ... FAIL
    
    ======================================================================
    FAIL: Doctest: meza.io.read_yaml
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/sw/lib/python3.9/doctest.py", line 2205, in runTest
        raise self.failureException(self.format_failure(new.getvalue()))
    AssertionError: Failed doctest test for meza.io.read_yaml
      File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 1256, in read_yaml
    
    ----------------------------------------------------------------------
    File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 1279, in meza.io.read_yaml
    Failed example:
        next(records) == {
            'text': 'Chicago Reader',
            'float': 1.0,
            'datetime': dt(1971, 1, 1, 4, 14),
            'boolean': True,
            'time': '04:14:00',
            'date': date(1971, 1, 1),
            'integer': 40}
    Exception raised:
        Traceback (most recent call last):
          File "/sw/lib/python3.9/doctest.py", line 1334, in __run
            exec(compile(example.source, filename, "single",
          File "<doctest meza.io.read_yaml[3]>", line 1, in <module>
            next(records) == {
          File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 551, in read_any
            for line in _read_any(f, reader, args, **kwargs):
          File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 470, in _read_any
            for num, line in enumerate(reader(f, *args, **kwargs)):
        TypeError: load() missing 1 required positional argument: 'Loader'
    

    In https://github.com/reubano/meza/blob/370c292e1e07e738006c2f57a9ff7399e775df44/meza/io.py#L1289 , changing yaml.load to yaml.safe_load takes care of the problem (not sure what version yaml.safe_load was introduced). See yaml/pyyaml#576

    opened by nieder 0
  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position... while reading Microsoft Access .mdb file

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position... while reading Microsoft Access .mdb file

    I am trying to read Microsoft Access [.mdb] file (created by ChemFinder on Windows), but I am getting UnicodeDecodeError: 'utf-8' codec... error despite specifying encoding as recovered by meza.io.get_encoding() to be TIS-620

    I would appreciate any suggestions...

    Details below:

    import meza
    fn = 'test.mdb'
    encoding = meza.io.get_encoding(fn)
    print(enc) # TIS-620
    records = meza.io.read_mdb(fn, encoding=enc)
    z = list(records)
    ~/anaconda3/lib/python3.8/site-packages/meza/io.py in read_mdb(filepath, table, **kwargs)
        636     # https://stackoverflow.com/a/17698359/408556
        637     with Popen(['mdb-export', filepath, table], **pkwargs).stdout as pipe:
    --> 638         first_line = StringIO(str(pipe.readline()))
        639         names = next(csv.reader(first_line, **kwargs))
        640         uscored = ft.underscorify(names) if sanitize else names
    
    ~/anaconda3/lib/python3.8/codecs.py in decode(self, input, final)
        320         # decode input (taking the buffer into account)
        321         data = self.buffer + input
    --> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
        323         # keep undecoded input until the next call
        324         self.buffer = data[consumed:]
    
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 70: invalid start byte
    

    I tried to pass encoding to pkwargs used in Popen (in meza/io.py)

        pkwargs = {'stdout': PIPE, 'bufsize': 1, 'universal_newlines': True}
    --> pkwargs['encoding'] = kwargs.get('encoding', None)
    
        # https://stackoverflow.com/a/2813530/408556
        # https://stackoverflow.com/a/17698359/408556
        with Popen(['mdb-export', filepath, table], **pkwargs).stdout as pipe:
    

    but it does not resolve the issue. With this modification I am getting:

    UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position 327: character maps to <undefined>
    
    bug can't reproduce 
    opened by ryszard314159 1
  • Getting error with large mdb file

    Getting error with large mdb file

    when I execute in ubuntu python3 (and this works fine if I have a small MDB file with one table)

    records = io.read_mdb(db_file_path) # only file path, no file objects
    next(records)
    
    read: Is a directory
    Couldn't read first page.
    Couldn't open database.
    
    
    bug can't reproduce 
    opened by hemanth-sp 1
  • Allow `process.detect_types` to match last type instead of the first

    Allow `process.detect_types` to match last type instead of the first

    Float values with zero as the fractional component e.g. '0.0', '0.1', '1.00' are detected as int instead of float. This is because they can be parsed as int according to fntools.is_int. Although the data could be interpreted as an integer, given that the source has a decimal place I would argue that detect_types should not perform casting to an integer. For example, database reports may include data from float/decimal columns which, just by chance, have no fractional component however this doesn't mean they should not be treated as floats.

    Example Test Case

    diff --git a/tests/test_process.py b/tests/test_process.py
    index cc538a2..9b720e5 100644
    --- a/tests/test_process.py
    +++ b/tests/test_process.py
    @@ -76,6 +76,12 @@ class Test:
             nt.assert_equal(Decimal('0.87'), result['confidence'])
             nt.assert_false(result['accurate'])
     
    +    def test_detect_types_floats_zero_fractional_component(self):
    +        records = it.cycle([{"foo": '0.0'}, {"foo": "1.0"}, {"foo": "10.00"}])
    +        records, result = pr.detect_types(records)
    +
    +        nt.assert_equal(result["types"], [{"id": "foo", "type": "float"}])
    +
         def test_fillempty(self):
             records = [
                 {'a': '1', 'b': '27', 'c': ''},
    

    Fails with:

    AssertionError: Lists differ: [{'id': 'foo', 'type': 'int'}] != [{'id': 'foo', 'type': 'float'}]
    
    First differing element 0:
    {'id': 'foo', 'type': 'int'}
    {'id': 'foo', 'type': 'float'}
    
    - [{'id': 'foo', 'type': 'int'}]
    ?                         ^^
    
    + [{'id': 'foo', 'type': 'float'}]
    ?                         ^^^^
    

    Potential Solutions

    • Prefer stricter type inference for floats by default; e.g. if it has decimal places, it's a float.
    • Allow stricter type inference for floats via an option e.g. a kwarg to detect_types that is passed down to is_int to change the behaviour from "can this be parsed as an int" to "this is definitely an int"
    • Any other ideas of course!

    I am happy to implement the changes required after a decision is made on the correct behaviour :smile:

    enhancement help wanted 
    opened by SteadBytes 2
  • Missing dependency

    Missing dependency "future" on Python 2.7

    Hi,

    When installing Meza v0.41.1 on Python 2.7, the dependency "future" is not installed (but required).

    To reproduce (for instance on Windows but the problem is the same on Linux):

    D:\Laurent\Projets\virtualenv>C:\Python27\python.exe -m virtualenv meza
    New python executable in D:\Laurent\Projets\virtualenv\meza\Scripts\python.exe
    Installing setuptools, pip, wheel...done.
    
    D:\Laurent\Projets\virtualenv>meza\Scripts\activate
    
    (meza) D:\Laurent\Projets\virtualenv>pip --version
    pip 19.0.1 from d:\laurent\projets\virtualenv\meza\lib\site-packages\pip (python 2.7)
    
    (meza) D:\Laurent\Projets\virtualenv>pip install meza==0.41.1
    [...]
    
    (meza) D:\Laurent\Projets\virtualenv>pip list
    Package                       Version
    ----------------------------- ----------
    backports.functools-lru-cache 1.5
    beautifulsoup4                4.7.1
    certifi                       2018.11.29
    chardet                       3.0.4
    dbfread                       2.0.4
    idna                          2.8
    ijson                         2.3
    meza                          0.41.1
    pip                           19.0.1
    pygogo                        0.12.0
    python-dateutil               2.7.5
    python-slugify                1.2.6
    PyYAML                        3.13
    requests                      2.21.0
    setuptools                    40.7.1
    six                           1.12.0
    soupsieve                     1.7.3
    Unidecode                     1.0.23
    urllib3                       1.24.1
    wheel                         0.32.3
    xlrd                          1.2.0
    

    As you can see, future is missing.

    The problem occurs because the Wheel meta info is not valid.

    If you want to install "future" only for Python 2.7, your requirements should be:

        'future>=0.16.0,<1.0.0; python_version < "3"'
    

    See:

    • The documentation: https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-platform-specific-dependencies
    • StackOverFlow: https://stackoverflow.com/a/32643122/1513933
    bug help wanted py2 
    opened by laurent-laporte-pro 1
Owner
Reuben Cummings
@mit alum. I build simple tools for analyzing and transforming data. When your data starts talking, I’m the one you want listening.
Reuben Cummings
Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

Eric Ma 1.1k Jan 1, 2023
functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

Coursera 188 Nov 24, 2022
Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

Prodmodel 53 Nov 29, 2022
Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hosted in the cloud. This project, demonstrates how to use these services to manage data we collect from different sources.

Riya Vijay Vishwakarma 1 Dec 12, 2021
dplyr for python

Dplython: Dplyr for Python Welcome to Dplython: Dplyr for Python. Dplyr is a library for the language R designed to make data analysis fast and easy.

Chris Riederer 754 Nov 21, 2022
Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

python-tabulate Pretty-print tabular data in Python, a library and a command-line utility. The main use cases of the library are: printing small table

Sergey Astanin 1.5k Jan 6, 2023
A Python package for manipulating 2-dimensional tabular data structures

datatable This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame

H2O.ai 1.6k Jan 5, 2023
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

vaex io 7.7k Jan 1, 2023
Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

Org. Account 165 Dec 31, 2022
A Python library for setting up projects using tabular data.

A Python library for setting up projects using tabular data. It can create project folders, standardize delimiters, and convert files to CSV from either individual files or a directory.

null 0 Dec 13, 2022
Python Automated Machine Learning library for tabular data.

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Scie

Daniel Khromov 47 Dec 17, 2022
Kartothek - a Python library to manage large amounts of tabular data in a blob store

Kartothek - a Python library to manage (create, read, update, delete) large amounts of tabular data in a blob store

null 15 Dec 25, 2022
ckan 3.6k Dec 27, 2022
Display tabular data in a visually appealing ASCII table format

PrettyTable Installation Install via pip: python -m pip install -U prettytable Install latest development version: python -m pip install -U git+https

Jazzband 924 Jan 5, 2023
A standard framework for modelling Deep Learning Models for tabular data

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike.

null 801 Jan 8, 2023
Implementation of TabTransformer, attention network for tabular data, in Pytorch

Tab Transformer Implementation of Tab Transformer, attention network for tabular data, in Pytorch. This simple architecture came within a hair's bread

Phil Wang 420 Jan 5, 2023
Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

Jainam Shah 243 Dec 30, 2022
Boosted neural network for tabular data

XBNet - Xtremely Boosted Network Boosted neural network for tabular data XBNet is an open source project which is built with PyTorch which tries to co

Tushar Sarkar 175 Jan 4, 2023
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

Gowthami Somepalli 284 Dec 21, 2022
deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

null 63 Oct 17, 2022