A Python toolkit for processing tabular data

Reuben Cummings

Last update: Dec 19, 2022

Related tags

Pipelines data library csv functional-programming xml excel xlsx tabular-data pandas featured

Overview

meza: A Python toolkit for processing tabular data

Index

Introduction

meza is a Python library for reading and processing tabular data. It has a functional programming style API, excels at reading/writing large files, and can process 10+ file types.

With meza, you can

Read csv/xls/xlsx/mdb/dbf files, and more!
Type cast records (date, float, text...)
Process Uñicôdë text
Lazily stream files by default
and much more...

Requirements

meza has been tested and is known to work on Python 3.6, 3.7, and 3.8; and PyPy 3.6.

Optional Dependencies

Function	Dependency	Installation	File type / extension
`meza.io.read_mdb`	mdbtools	`sudo port install mdbtools`	Microsoft Access / mdb
`meza.io.read_html`	lxml [1]	`pip install lxml`	HTML / html
`meza.convert.records2array`	NumPy [2]	`pip install numpy`	n/a
`meza.convert.records2df`	pandas	`pip install pandas`	n/a

Notes

[1]	If `lxml` isn't present, `read_html` will default to the builtin Python html reader

[2]	`records2array` can be used without `numpy` by passing `native=True` in the function call. This will convert `records` into a list of native `array.array` objects.

Motivation

Why I built meza

pandas is great, but installing it isn't exactly a walk in the park, and it doesn't play nice with PyPy. I designed meza to be a lightweight, easy to install, less featureful alternative to pandas. I also optimized meza for low memory usage, PyPy compatibility, and functional programming best practices.

Why you should use meza

meza provides a number of benefits / differences from similar libraries such as pandas. Namely:

a functional programming (instead of object oriented) API
iterators by default (reading/writing)
PyPy compatibility
geojson support (reading/writing)
seamless integration with sqlachemy (and other libs that work with iterators of dicts)

For more detailed information, please check-out the FAQ.

Hello World

A simple data processing example is shown below:

First create a simple csv file (in bash)

echo 'col1,col2,col3\nhello,5/4/82,1\none,1/1/15,2\nhappy,7/1/92,3\n' > data.csv

Now we can read the file, manipulate the data a bit, and write the manipulated data back to a new file.

>>> from meza import io, process as pr, convert as cv
>>> from io import open

>>> # Load the csv file
>>> records = io.read_csv('data.csv')

>>> # `records` are iterators over the rows
>>> row = next(records)
>>> row
{'col1': 'hello', 'col2': '5/4/82', 'col3': '1'}

>>> # Let's replace the first row so as not to lose any data
>>> records = pr.prepend(records, row)

# Guess column types. Note: `detect_types` returns a new `records`
# generator since it consumes rows during type detection
>>> records, result = pr.detect_types(records)
>>> {t['id']: t['type'] for t in result['types']}
{'col1': 'text', 'col2': 'date', 'col3': 'int'}

# Now type cast the records. Note: most `meza.process` functions return
# generators, so lets wrap the result in a list to view the data
>>> casted = list(pr.type_cast(records, result['types']))
>>> casted[0]
{'col1': 'hello', 'col2': datetime.date(1982, 5, 4), 'col3': 1}

# Cut out the first column of data and merge the rows to get the max value
# of the remaining columns. Note: since `merge` (by definition) will always
# contain just one row, it is returned as is (not wrapped in a generator)
>>> cut_recs = pr.cut(casted, ['col1'], exclude=True)
>>> merged = pr.merge(cut_recs, pred=bool, op=max)
>>> merged
{'col2': datetime.date(2015, 1, 1), 'col3': 3}

# Now write merged data back to a new csv file.
>>> io.write('out.csv', cv.records2csv(merged))

# View the result
>>> with open('out.csv', 'utf-8') as f:
...     f.read()
'col2,col3\n2015-01-01,3\n'

Usage

meza is intended to be used directly as a Python library.

Reading data

meza can read both filepaths and file-like objects. Additionally, all readers return equivalent records iterators, i.e., a generator of dictionaries with keys corresponding to the column names.

>>> from io import open, StringIO
>>> from meza import io

"""Read a filepath"""
>>> records = io.read_json('path/to/file.json')

"""Read a file like object and de-duplicate the header"""
>>> f = StringIO('col,col\nhello,world\n')
>>> records = io.read_csv(f, dedupe=True)

"""View the first row"""
>>> next(records)
{'col': 'hello', 'col_2': 'world'}

"""Read the 1st sheet of an xls file object opened in text mode."""
# Also, santize the header names by converting them to lowercase and
# replacing whitespace and invalid characters with `_`.
>>> with open('path/to/file.xls', 'utf-8') as f:
...     for row in io.read_xls(f, sanitize=True):
...         # do something with the `row`
...         pass

"""Read the 2nd sheet of an xlsx file object opened in binary mode"""
# Note: sheets are zero indexed
>>> with open('path/to/file.xlsx') as f:
...     records = io.read_xls(f, encoding='utf-8', sheet=1)
...     first_row = next(records)
...     # do something with the `first_row`

"""Read any recognized file"""
>>> records = io.read('path/to/file.geojson')
>>> f.seek(0)
>>> records = io.read(f, ext='csv', dedupe=True)

Please see readers for a complete list of available readers and recognized file types.

Processing data

Numerical analysis (à la pandas) [3]

In the following example, pandas equivalent methods are preceded by -->.

>>> import itertools as it
>>> import random

>>> from io import StringIO
>>> from meza import io, process as pr, convert as cv, stats

# Create some data in the same structure as what the various `read...`
# functions output
>>> header = ['A', 'B', 'C', 'D']
>>> data = [(random.random() for _ in range(4)) for x in range(7)]
>>> df = [dict(zip(header, d)) for d in data]
>>> df[0]
{'A': 0.53908..., 'B': 0.28919..., 'C': 0.03003..., 'D': 0.65363...}

"""Sort records by the value of column `B` --> df.sort_values(by='B')"""
>>> next(pr.sort(df, 'B'))
{'A': 0.53520..., 'B': 0.06763..., 'C': 0.02351..., 'D': 0.80529...}

"""Select column `A` --> df['A']"""
>>> next(pr.cut(df, ['A']))
{'A': 0.53908170489952006}

"""Select the first three rows of data --> df[0:3]"""
>>> len(list(it.islice(df, 3)))
3

"""Select all data whose value for column `A` is less than 0.5
--> df[df.A < 0.5]
"""
>>> next(pr.tfilter(df, 'A', lambda x: x < 0.5))
{'A': 0.21000..., 'B': 0.25727..., 'C': 0.39719..., 'D': 0.64157...}

# Note: since `aggregate` and `merge` (by definition) return just one row,
# they return them as is (not wrapped in a generator).
"""Calculate the mean of column `A` across all data --> df.mean()['A']"""
>>> pr.aggregate(df, 'A', stats.mean)['A']
0.5410437473067938

"""Calculate the sum of each column across all data --> df.sum()"""
>>> pr.merge(df, pred=bool, op=sum)
{'A': 3.78730..., 'C': 2.82875..., 'B': 3.14195..., 'D': 5.26330...}

Text processing (à la csvkit) [4]

In the following example, csvkit equivalent commands are preceded by -->.

First create a few simple csv files (in bash)

echo 'col_1,col_2,col_3\n1,dill,male\n2,bob,male\n3,jane,female' > file1.csv
echo 'col_1,col_2,col_3\n4,tom,male\n5,dick,male\n6,jill,female' > file2.csv

Now we can read the files, manipulate the data, convert the manipulated data to json, and write the json back to a new file. Also, note that since all readers return equivalent records iterators, you can use them interchangeably (in place of read_csv) to open any supported file. E.g., read_xls, read_sqlite, etc.

>>> import itertools as it

>>> from meza import io, process as pr, convert as cv

"""Combine the files into one iterator
--> csvstack file1.csv file2.csv
"""
>>> records = io.join('file1.csv', 'file2.csv')
>>> next(records)
{'col_1': '1', 'col_2': 'dill', 'col_3': 'male'}
>>> next(it.islice(records, 4, None))
{'col_1': '6', 'col_2': 'jill', 'col_3': 'female'}

# Now let's create a persistent records list
>>> records = list(io.read_csv('file1.csv'))

"""Sort records by the value of column `col_2`
--> csvsort -c col_2 file1.csv
"""
>>> next(pr.sort(records, 'col_2'))
{'col_1': '2', 'col_2': 'bob', 'col_3': 'male'

"""Select column `col_2` --> csvcut -c col_2 file1.csv"""
>>> next(pr.cut(records, ['col_2']))
{'col_2': 'dill'}

"""Select all data whose value for column `col_2` contains `jan`
--> csvgrep -c col_2 -m jan file1.csv
"""
>>> next(pr.grep(records, [{'pattern': 'jan'}], ['col_2']))
{'col_1': '3', 'col_2': 'jane', 'col_3': 'female'}

"""Convert a csv file to json --> csvjson -i 4 file1.csv"""
>>> io.write('file.json', cv.records2json(records))

# View the result
>>> with open('file.json', 'utf-8') as f:
...     f.read()
'[{"col_1": "1", "col_2": "dill", "col_3": "male"}, {"col_1": "2",
"col_2": "bob", "col_3": "male"}, {"col_1": "3", "col_2": "jane",
"col_3": "female"}]'

Geo processing (à la mapbox) [5]

In the following example, mapbox equivalent commands are preceded by -->.

First create a geojson file (in bash)

echo '{"type": "FeatureCollection","features": [' > file.geojson
echo '{"type": "Feature", "id": 11, "geometry": {"type": "Point", "coordinates": [10, 20]}},' >> file.geojson
echo '{"type": "Feature", "id": 12, "geometry": {"type": "Point", "coordinates": [5, 15]}}]}' >> file.geojson

Now we can open the file, split the data by id, and finally convert the split data to a new geojson file-like object.

>>> from meza import io, process as pr, convert as cv

# Load the geojson file and peek at the results
>>> records, peek = pr.peek(io.read_geojson('file.geojson'))
>>> peek[0]
{'lat': 20, 'type': 'Point', 'lon': 10, 'id': 11}

"""Split the records by feature ``id`` and select the first feature
--> geojsplit -k id file.geojson
"""
>>> splits = pr.split(records, 'id')
>>> feature_records, name = next(splits)
>>> name
11

"""Convert the feature records into a GeoJSON file-like object"""
>>> geojson = cv.records2geojson(feature_records)
>>> geojson.readline()
'{"type": "FeatureCollection", "bbox": [10, 20, 10, 20], "features": '
'[{"type": "Feature", "id": 11, "geometry": {"type": "Point", '
'"coordinates": [10, 20]}, "properties": {"id": 11}}], "crs": {"type": '
'"name", "properties": {"name": "urn:ogc:def:crs:OGC:1.3:CRS84"}}}'

# Note: you can also write back to a file as shown previously
# io.write('file.geojson', geojson)

Writing data

meza can persist records to disk via the following functions:

meza.convert.records2csv
meza.convert.records2json
meza.convert.records2geojson

Each function returns a file-like object that you can write to disk via meza.io.write('/path/to/file', result).

>>> from meza import io, convert as cv
>>> from io import StringIO, open

# First let's create a simple tsv file like object
>>> f = StringIO('col1\tcol2\nhello\tworld\n')
>>> f.seek(0)

# Next create a records list so we can reuse it
>>> records = list(io.read_tsv(f))
>>> records[0]
{'col1': 'hello', 'col2': 'world'}

# Now we're ready to write the records data to file

"""Create a csv file like object"""
>>> cv.records2csv(records).readline()
'col1,col2\n'

"""Create a json file like object"""
>>> cv.records2json(records).readline()
'[{"col1": "hello", "col2": "world"}]'

"""Write back csv to a filepath"""
>>> io.write('file.csv', cv.records2csv(records))
>>> with open('file.csv', 'utf-8') as f_in:
...     f_in.read()
'col1,col2\nhello,world\n'

"""Write back json to a filepath"""
>>> io.write('file.json', cv.records2json(records))
>>> with open('file.json', 'utf-8') as f_in:
...     f_in.readline()
'[{"col1": "hello", "col2": "world"}]'

Cookbook

Please see the cookbook or ipython notebook for more examples.

Notes

[3]	http://pandas.pydata.org/pandas-docs/stable/10min.html#min

[4]	https://csvkit.readthedocs.org/en/0.9.1/cli.html#processing

[5]	https://github.com/mapbox?utf8=%E2%9C%93&query=geojson

Interoperability

meza plays nicely with NumPy and friends out of the box

setup

from meza import process as pr

# First create some records and types. Also, convert the records to a list
# so we can reuse them.
>>> records = [{'a': 'one', 'b': 2}, {'a': 'five', 'b': 10, 'c': 20.1}]
>>> records, result = pr.detect_types(records)
>>> records, types = list(records), result['types']
>>> types
[
    {'type': 'text', 'id': 'a'},
    {'type': 'int', 'id': 'b'},
    {'type': 'float', 'id': 'c'}]

from records to pandas.DataFrame to records

>>> import pandas as pd
>>> from meza import convert as cv

"""Convert the records to a DataFrame"""
>>> df = cv.records2df(records, types)
>>> df
        a   b   c
0   one   2   NaN
1  five  10  20.1
# Alternatively, you can do `pd.DataFrame(records)`

"""Convert the DataFrame back to records"""
>>> next(cv.df2records(df))
{'a': 'one', 'b': 2, 'c': nan}

from records to arrays to records

>>> import numpy as np

>>> from array import array
>>> from meza import convert as cv

"""Convert records to a structured array"""
>>> recarray = cv.records2array(records, types)
>>> recarray
rec.array([('one', 2, nan), ('five', 10, 20.100000381469727)],
          dtype=[('a', 'O'), ('b', '<i4'), ('c', '<f4')])
>>> recarray.b
array([ 2, 10], dtype=int32)

"""Convert records to a native array"""
>>> narray = cv.records2array(records, types, native=True)
>>> narray
[[array('u', 'a'), array('u', 'b'), array('u', 'c')],
[array('u', 'one'), array('u', 'five')],
array('i', [2, 10]),
array('f', [0.0, 20.100000381469727])]

"""Convert a 2-D NumPy array to a records generator"""
>>> data = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
>>> data
array([[1, 2, 3],
       [4, 5, 6]], dtype=int32)
>>> next(cv.array2records(data))
{'column_1': 1, 'column_2': 2, 'column_3': 3}

"""Convert the structured array back to a records generator"""
>>> next(cv.array2records(recarray))
{'a': 'one', 'b': 2, 'c': nan}

"""Convert the native array back to records generator"""
>>> next(cv.array2records(narray, native=True))
{'a': 'one', 'b': 2, 'c': 0.0}

Installation

(You are using a virtualenv, right?)

At the command line, install meza using either pip (recommended)

pip install meza

or easy_install

easy_install meza

Please see the installation doc for more details.

Project Structure

┌── CONTRIBUTING.rst
├── LICENSE
├── MANIFEST.in
├── Makefile
├── README.rst
├── data
│   ├── converted/*
│   └── test/*
├── dev-requirements.txt
├── docs
│   ├── AUTHORS.rst
│   ├── CHANGES.rst
│   ├── COOKBOOK.rst
│   ├── FAQ.rst
│   ├── INSTALLATION.rst
│   └── TODO.rst
├── examples
│   ├── usage.ipynb
│   └── usage.py
├── helpers/*
├── manage.py
├── meza
│   ├── __init__.py
│   ├── convert.py
│   ├── dbf.py
│   ├── fntools.py
│   ├── io.py
│   ├── process.py
│   ├── stats.py
│   ├── typetools.py
│   └── unicsv.py
├── optional-requirements.txt
├── py2-requirements.txt
├── requirements.txt
├── setup.cfg
├── setup.py
├── tests
│   ├── __init__.py
│   ├── standard.rc
│   ├── test_fntools.py
│   ├── test_io.py
│   └── test_process.py
└── tox.ini

Design Principles

prefer functions over objects
provide enough functionality out of the box to easily implement the most common data analysis use cases
make conversion between records, arrays, and DataFrames dead simple
whenever possible, lazily read objects and stream the result [6]

[6]	Notable exceptions are `meza.process.group`, `meza.process.sort`, `meza.io.read_dbf`, `meza.io.read_yaml`, and `meza.io.read_html`. These functions read the entire contents into memory up front.

Scripts

meza comes with a built in task manager manage.py

Setup

pip install -r dev-requirements.txt

Examples

Run python linter and nose tests

manage lint
manage test

Contributing

Please mimic the coding style/conventions used in this repo. If you add new classes or functions, please add the appropriate doc blocks with examples. Also, make sure the python linter and nose tests pass.

Please see the contributing doc for more details.

Credits

Shoutouts to csvkit, messytables, and pandas for heavily inspiring meza.

More Info

License

meza is distributed under the MIT License.

Comments

At the end of the generator, catch StopIteration and return none

When looping through the generator to get the rows/records or converting the record to an array/dataframe, I am getting RuntimeError: generator raised StopIteration. This seems to be a Python 3.7 issue with the PEP 479.

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/meza/io.py", line 664, in read_mdb
    values = next(csv.reader(next_line, **kwargs))
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "main.py", line 7, in <module>
    arr = convert.records2array(records, result["types"])
  File "/usr/local/lib/python3.7/site-packages/meza/convert.py", line 710, in records2array
    data = [tuple(r.get(id_) for id_ in ids) for r in records]
  File "/usr/local/lib/python3.7/site-packages/meza/convert.py", line 710, in <listcomp>
    data = [tuple(r.get(id_) for id_ in ids) for r in records]
RuntimeError: generator raised StopIteration

To fix this issue, when StopIteration is raised in read_mdb(), catch the exception and return None instead of having StopIteration transformed into a RuntimeError.

opened by christian-ensodata 5

Datetimes with all-zero time component (exact midnight) are detected as dates
Datetime values with all-zero time components are detected as dates by process.detect_types. This is because typetools.is_datetime explicitly checks that the time component is not '00:00:00' https://github.com/reubano/meza/blob/a56b927ba5264e450c8358a1cf5218402a98b6e8/meza/typetools.py#L295 . I'd argue that if a time value is present at all then the value should be treated as a datetime as this likely better represents the intent of the source. For example, a database report may include a datetime column but all the values in a particular output happened to occur at midnight. The down-casting behaviour is similar to that of floats https://github.com/reubano/meza/issues/34.

Example Test Case

diff --git a/tests/test_process.py b/tests/test_process.py index cc538a2..f90556d 100644 --- a/tests/test_process.py +++ b/tests/test_process.py @@ -76,6 +76,12 @@ class Test: nt.assert_equal(Decimal('0.87'), result['confidence']) nt.assert_false(result['accurate']) + def test_detect_types_datetimes_midnight(self): + records = it.repeat({"foo": "2000-01-01 00:00:00"}) + records, result = pr.detect_types(records) + + nt.assert_equal(result["types"], [{"id": "foo", "type": "datetime"}]) + def test_fillempty(self): records = [ {'a': '1', 'b': '27', 'c': ''},

Fails with:

'AssertionError: Lists differ: [{'id': 'foo', 'type': 'date'}] != [{'id': 'foo', 'type': 'datetime'}] First differing element 0: {'id': 'foo', 'type': 'date'} {'id': 'foo', 'type': 'datetime'} - [{'id': 'foo', 'type': 'date'}] + [{'id': 'foo', 'type': 'datetime'}] ? ++++

Making the time part non-zero will pass the test.

Potential Solutions

Prefer stricter type inference for datetimes by default; e.g. if it has both a date and a time field, it's a datetime.

Allow stricter type inference for datetimes as an option e.g. a kwarg to detect_types that is passed down to is_datetime to change the behaviour from "can this only be parsed as a datetime" to "this is a datetime"

Any other ideas of course!

I am happy to implement the changes required after a decision is made on the correct behaviour :smile:
bug
opened by SteadBytes 5
type casting assumes 'month first' for ambiguous dates
Right now, the type detection does infer a date, datetime or time types without taking into account the fact that 01/02/2002 can be both Februrary the 1st or January the 2nd depending on the date format used respectively DD/MM/YYYY and MM/DD/YYYY.

This might be undecidable in some rare cases, but in general it's possible given enough values to decide between both formats.

One possible way, to handle this in meza is to use a higher level datatype for representing the type of a field to replace the current string representation. For instance:

datetime_type = namedtuple('DateTimeType', ['format'])

Basically, use a representation that takes optional extra information about the type.
bug
opened by amirouche 3
requests >=2.10.0?

Hello. I'm trying to install meza with pip-tools, but am getting a dependency constraint resolution error between this package and the awsebcli package. meza calls for requests >= 2.10.0, and awsebcli calls for <=2.9.1.

Is there a specific reason for meza to require >= 2.10.0? Could this requirement be loosened at all, since requests is pretty stable?

Thanks!

opened by theunraveler 3
Config file for pyup.io

Hi there and thanks for using pyup.io!

Since you are using a non-default config I've created one for you.

There are a lot of things you can configure on top of that, so make sure to check out the docs to see what I can do for you.

opened by pyup-bot 3
Fix setup fails

fix #19 This commit fix setup.py error error in meza setup "command: 'tests_require' must be a string or list of strings containing valid project/version requirement specifiers; Unordered types are not allowed"

opened by adhaamehab 2
In py3.7+, read_mdb should not use StopIteration
Here is a simple example:

from meza.io import read_mdb import pandas as pd process_data = read_mdb(file_path, "ProcessData") process_data_df = pd.DataFrame(process_data) process_data.close()

Before py3.7, this works OK. After py3.7, it throws "RuntimeError: generator raised StopIteration". This is due to PEP-479 - https://peps.python.org/pep-0479/ which changed how StopIteration is handled inside generators.

I tried this with v 0.45.5, which should be py3.7 compatible, and still see the error.
opened by agileminor 1

ValueError converting zero-value currencies

Type detection raises a ValueError for a currencies with a value of zero e.g. '$0', '0$'.

>>> import itertools as it
>>> from meza import process as pr
>>> 
>>> records = it.repeat({"money": "$0"})
>>> records, result = pr.detect_types(records)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/meza/meza/process.py", line 333, in detect_types
    for t in tt.guess_type_by_value(record):
  File "/meza/meza/typetools.py", line 172, in guess_type_by_value
    result = type_test(g['func'], g['type'], key, value)
  File "/meza/meza/typetools.py", line 33, in type_test
    passed = test(value)
  File "//meza/meza/fntools.py", line 509, in is_int
    passed = is_numeric(content, thousand_sep, decimal_sep)
  File "/meza/meza/fntools.py", line 489, in is_numeric
    passed = int(content) == 0
ValueError: invalid literal for int() with base 10: '$0'

This is caused by is_numeric casting the original, unstripped content to an int: https://github.com/reubano/meza/blob/110f855fa95bcd9665018358059d9df25de1dedf/meza/fntools.py#L489

As far as I can tell, this should only be an issue when the value starts with 0 and the only non-numeric characters are currency symbols. Here is a failing test case for this:

diff --git a/tests/test_fntools.py b/tests/test_fntools.py
index 922bc17..f8cdc75 100644
--- a/tests/test_fntools.py
+++ b/tests/test_fntools.py
@@ -45,6 +45,11 @@ class TestIterStringIO:
         nt.assert_false(ft.is_numeric(None))
         nt.assert_false(ft.is_numeric(''))
 
+    def test_is_numeric_0_currency(self):
+        for sym in ft.CURRENCIES:
+            nt.assert_true(ft.is_numeric(f'0{sym}'))
+            nt.assert_true(ft.is_numeric(f'{sym}0'))
+
     def test_is_int(self):
         nt.assert_false(ft.is_int('5/4/82'))

opened by SteadBytes 1

Unexepcted warning from iterators

Hello guys,

I'm having a pletora of warnings from meza and they look like this

/home/simone/hypothesis-csv/.eggs/meza-0.41.1-py3.6.egg/meza/process.py:342: DeprecationWarning: generator 'gen_types' raised StopIteration types = list(gen_types(tally)) /home/simone/hypothesis-csv/.eggs/meza-0.41.1-py3.6.egg/meza/process.py:335: DeprecationWarning: generator 'guess_type_by_value' raised StopIteration for t in tt.guess_type_by_value(record): /home/simone/hypothesis-csv/.eggs/meza-0.41.1-py3.6.egg/meza/process.py:342: DeprecationWarning: generator 'gen_types' raised StopIteration types = list(gen_types(tally))

The version is 0.41.1. I checked the code and it actually doesn't raise StopIteration so I have no idea where the error comes from? Any insight?
can't reproduce

opened by chobeat 1
Writing using different dialects

The documentation provides no examples on how to do this and I cannot find tests that cover this feature. From the code I understand more or less how it should work but I'm not sure. Is there some example somewhere?

opened by chobeat 1
Added quiet bool param for mdb_open to supress out of table names

The first execution of the next(mdb) function will generate output to STDOUT as a side effect of the subprocess.check_process() call. I changed this to a subprocess.check_output() call and just discard the output. I think we're really only looking for the error status.

I had to reorder some of the other lines to support the quiet boolean parameter.

opened by petonic 1

test failure in test_excel_html_export with io.read_html

Testing meza-0.46.0, I get this error (py38-py310):

Test for reading an html table exported from excel ... FAIL

======================================================================
FAIL: Test for reading an html table exported from excel
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sw/lib/python3.9/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/tests/test_io.py", line 354, in test_excel_html_export
    nt.assert_equal(expected, next(records))
AssertionError: {'sparse_data': 'Iñtërnâtiônàližætiøn', 'so[61 chars]dam'} != {'13_width_75_some_date': '13 class=xl24 al[123 chars]dam'}
- {'some_date': '05/04/82',
+ {'13_width_75_some_date': '13 class=xl24 align=right>05/04/82',
+  '2_width_150_unicode_test': 'Ādam',
-  'some_value': '234',
+  '75_some_value': 'right>234',
?   +++              ++++++

-  'sparse_data': 'Iñtërnâtiônàližætiøn',
?                                       ^

+  '75_sparse_data': 'Iñtërnâtiônàližætiøn'}
?   +++                                    ^

-  'unicode_test': 'Ādam'}

----------------------------------------------------------------------

The output in the AssertionError line seems all mangled with the attributes from the different html table elements sprinkled in. If I remove the html attributes for the table in data/test/test.htm, then the test passes. I notice that io.read_html uses BeautifulSoup. I have beautifulsoup-4.10.0 and soupsieve-2.3.1 installed.

opened by nieder 0

io.py incompatible with PyYAML-6.0

When building meza, this test failure happens:

Doctest: meza.io.read_yaml ... FAIL

======================================================================
FAIL: Doctest: meza.io.read_yaml
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sw/lib/python3.9/doctest.py", line 2205, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for meza.io.read_yaml
  File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 1256, in read_yaml

----------------------------------------------------------------------
File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 1279, in meza.io.read_yaml
Failed example:
    next(records) == {
        'text': 'Chicago Reader',
        'float': 1.0,
        'datetime': dt(1971, 1, 1, 4, 14),
        'boolean': True,
        'time': '04:14:00',
        'date': date(1971, 1, 1),
        'integer': 40}
Exception raised:
    Traceback (most recent call last):
      File "/sw/lib/python3.9/doctest.py", line 1334, in __run
        exec(compile(example.source, filename, "single",
      File "<doctest meza.io.read_yaml[3]>", line 1, in <module>
        next(records) == {
      File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 551, in read_any
        for line in _read_any(f, reader, args, **kwargs):
      File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 470, in _read_any
        for num, line in enumerate(reader(f, *args, **kwargs)):
    TypeError: load() missing 1 required positional argument: 'Loader'

In https://github.com/reubano/meza/blob/370c292e1e07e738006c2f57a9ff7399e775df44/meza/io.py#L1289 , changing yaml.load to yaml.safe_load takes care of the problem (not sure what version yaml.safe_load was introduced). See yaml/pyyaml#576

opened by nieder 0

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position... while reading Microsoft Access .mdb file

I am trying to read Microsoft Access [.mdb] file (created by ChemFinder on Windows), but I am getting UnicodeDecodeError: 'utf-8' codec... error despite specifying encoding as recovered by meza.io.get_encoding() to be TIS-620

I would appreciate any suggestions...

Details below:

import meza
fn = 'test.mdb'
encoding = meza.io.get_encoding(fn)
print(enc) # TIS-620
records = meza.io.read_mdb(fn, encoding=enc)
z = list(records)
~/anaconda3/lib/python3.8/site-packages/meza/io.py in read_mdb(filepath, table, **kwargs)
    636     # https://stackoverflow.com/a/17698359/408556
    637     with Popen(['mdb-export', filepath, table], **pkwargs).stdout as pipe:
--> 638         first_line = StringIO(str(pipe.readline()))
    639         names = next(csv.reader(first_line, **kwargs))
    640         uscored = ft.underscorify(names) if sanitize else names

~/anaconda3/lib/python3.8/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 70: invalid start byte

I tried to pass encoding to pkwargs used in Popen (in meza/io.py)

    pkwargs = {'stdout': PIPE, 'bufsize': 1, 'universal_newlines': True}
--> pkwargs['encoding'] = kwargs.get('encoding', None)

    # https://stackoverflow.com/a/2813530/408556
    # https://stackoverflow.com/a/17698359/408556
    with Popen(['mdb-export', filepath, table], **pkwargs).stdout as pipe:

but it does not resolve the issue. With this modification I am getting:

UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position 327: character maps to <undefined>

bug can't reproduce

opened by ryszard314159 1

Getting error with large mdb file
when I execute in ubuntu python3 (and this works fine if I have a small MDB file with one table)

records = io.read_mdb(db_file_path) # only file path, no file objects next(records)

read: Is a directory Couldn't read first page. Couldn't open database.
bug can't reproduce
opened by hemanth-sp 1
Allow `process.detect_types` to match last type instead of the first
Float values with zero as the fractional component e.g. '0.0', '0.1', '1.00' are detected as int instead of float. This is because they can be parsed as int according to fntools.is_int. Although the data could be interpreted as an integer, given that the source has a decimal place I would argue that detect_types should not perform casting to an integer. For example, database reports may include data from float/decimal columns which, just by chance, have no fractional component however this doesn't mean they should not be treated as floats.

Example Test Case

diff --git a/tests/test_process.py b/tests/test_process.py index cc538a2..9b720e5 100644 --- a/tests/test_process.py +++ b/tests/test_process.py @@ -76,6 +76,12 @@ class Test: nt.assert_equal(Decimal('0.87'), result['confidence']) nt.assert_false(result['accurate']) + def test_detect_types_floats_zero_fractional_component(self): + records = it.cycle([{"foo": '0.0'}, {"foo": "1.0"}, {"foo": "10.00"}]) + records, result = pr.detect_types(records) + + nt.assert_equal(result["types"], [{"id": "foo", "type": "float"}]) + def test_fillempty(self): records = [ {'a': '1', 'b': '27', 'c': ''},

Fails with:

AssertionError: Lists differ: [{'id': 'foo', 'type': 'int'}] != [{'id': 'foo', 'type': 'float'}] First differing element 0: {'id': 'foo', 'type': 'int'} {'id': 'foo', 'type': 'float'} - [{'id': 'foo', 'type': 'int'}] ? ^^ + [{'id': 'foo', 'type': 'float'}] ? ^^^^

Potential Solutions

Prefer stricter type inference for floats by default; e.g. if it has decimal places, it's a float.

Allow stricter type inference for floats via an option e.g. a kwarg to detect_types that is passed down to is_int to change the behaviour from "can this be parsed as an int" to "this is definitely an int"

Any other ideas of course!

I am happy to implement the changes required after a decision is made on the correct behaviour :smile:
enhancement help wanted
opened by SteadBytes 2

Missing dependency "future" on Python 2.7

Hi,

When installing Meza v0.41.1 on Python 2.7, the dependency "future" is not installed (but required).

To reproduce (for instance on Windows but the problem is the same on Linux):

D:\Laurent\Projets\virtualenv>C:\Python27\python.exe -m virtualenv meza
New python executable in D:\Laurent\Projets\virtualenv\meza\Scripts\python.exe
Installing setuptools, pip, wheel...done.

D:\Laurent\Projets\virtualenv>meza\Scripts\activate

(meza) D:\Laurent\Projets\virtualenv>pip --version
pip 19.0.1 from d:\laurent\projets\virtualenv\meza\lib\site-packages\pip (python 2.7)

(meza) D:\Laurent\Projets\virtualenv>pip install meza==0.41.1
[...]

(meza) D:\Laurent\Projets\virtualenv>pip list
Package                       Version
----------------------------- ----------
backports.functools-lru-cache 1.5
beautifulsoup4                4.7.1
certifi                       2018.11.29
chardet                       3.0.4
dbfread                       2.0.4
idna                          2.8
ijson                         2.3
meza                          0.41.1
pip                           19.0.1
pygogo                        0.12.0
python-dateutil               2.7.5
python-slugify                1.2.6
PyYAML                        3.13
requests                      2.21.0
setuptools                    40.7.1
six                           1.12.0
soupsieve                     1.7.3
Unidecode                     1.0.23
urllib3                       1.24.1
wheel                         0.32.3
xlrd                          1.2.0

As you can see, future is missing.

The problem occurs because the Wheel meta info is not valid.

If you want to install "future" only for Python 2.7, your requirements should be:

    'future>=0.16.0,<1.0.0; python_version < "3"'

See:

The documentation: https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-platform-specific-dependencies
StackOverFlow: https://stackoverflow.com/a/32643122/1513933

bug help wanted py2

opened by laurent-laporte-pro 1

Owner

Reuben Cummings

@mit alum. I build simple tools for analyzing and transforming data. When your data starts talking, I’m the one you want listening.

GitHub

Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

1.1k Jan 1, 2023

functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

188 Nov 24, 2022

Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

53 Nov 29, 2022

Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hosted in the cloud. This project, demonstrates how to use these services to manage data we collect from different sources.

1 Dec 12, 2021

dplyr for python

Dplython: Dplyr for Python Welcome to Dplython: Dplyr for Python. Dplyr is a library for the language R designed to make data analysis fast and easy.

754 Nov 21, 2022

Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

python-tabulate Pretty-print tabular data in Python, a library and a command-line utility. The main use cases of the library are: printing small table

1.5k Jan 6, 2023

A Python package for manipulating 2-dimensional tabular data structures

datatable This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame

1.6k Jan 5, 2023

Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

7.7k Jan 1, 2023

Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

165 Dec 31, 2022

A Python library for setting up projects using tabular data.

A Python library for setting up projects using tabular data. It can create project folders, standardize delimiters, and convert files to CSV from either individual files or a directory.

0 Dec 13, 2022

Python Automated Machine Learning library for tabular data.

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Scie

47 Dec 17, 2022

Kartothek - a Python library to manage large amounts of tabular data in a blob store

Kartothek - a Python library to manage (create, read, update, delete) large amounts of tabular data in a blob store

15 Dec 25, 2022

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

CKAN: The Open Source Data Portal Software CKAN is the world’s leading open-source data portal platform. CKAN makes it easy to publish, share and work

3.6k Dec 27, 2022

Display tabular data in a visually appealing ASCII table format

PrettyTable Installation Install via pip: python -m pip install -U prettytable Install latest development version: python -m pip install -U git+https

924 Jan 5, 2023

A standard framework for modelling Deep Learning Models for tabular data

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike.

801 Jan 8, 2023

Implementation of TabTransformer, attention network for tabular data, in Pytorch

Tab Transformer Implementation of Tab Transformer, attention network for tabular data, in Pytorch. This simple architecture came within a hair's bread

420 Jan 5, 2023

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

243 Dec 30, 2022

Boosted neural network for tabular data

XBNet - Xtremely Boosted Network Boosted neural network for tabular data XBNet is an open source project which is built with PyTorch which tries to co

175 Jan 4, 2023

The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

284 Dec 21, 2022

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

63 Oct 17, 2022

A Python toolkit for processing tabular data

Related tags

Overview

meza: A Python toolkit for processing tabular data

Index

Introduction

Requirements

Optional Dependencies

Notes

Motivation

Why I built meza

Why you should use meza

Hello World

Usage

Usage Index

Reading data

Processing data

Numerical analysis (à la pandas) [3]

Text processing (à la csvkit) [4]

Geo processing (à la mapbox) [5]

Writing data

Cookbook

Notes

Interoperability

setup

from records to pandas.DataFrame to records

from records to arrays to records

Installation

Project Structure

Design Principles

Scripts

Setup

Examples

Contributing

Credits

More Info

License

Comments

Example Test Case

Potential Solutions

I tried to pass encoding to pkwargs used in Popen (in meza/io.py)

but it does not resolve the issue. With this modification I am getting:

Example Test Case

Potential Solutions

Owner

Reuben Cummings

Clean APIs for data cleaning. Python implementation of R package Janitor

functional data manipulation for pandas

Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Microsoft Azure provides a wide number of services for managing and storing data

dplyr for python

Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

A Python package for manipulating 2-dimensional tabular data structures

Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

Python library to extract tabular data from images and scanned PDFs

A Python library for setting up projects using tabular data.

Python Automated Machine Learning library for tabular data.

Kartothek - a Python library to manage large amounts of tabular data in a blob store

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

Display tabular data in a visually appealing ASCII table format

A standard framework for modelling Deep Learning Models for tabular data

Implementation of TabTransformer, attention network for tabular data, in Pytorch

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

Boosted neural network for tabular data

The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.