Pandas and Dask test helper methods with beautiful error messages.

Matthew Powers

Last update: Nov 28, 2022

Related tags

Data Analysis beavis

Overview

beavis

Pandas and Dask test helper methods with beautiful error messages.

test helpers

These test helper methods are meant to be used in test suites. They provide descriptive error messages to allow for a seamless development workflow.

The test helpers are inspired by chispa and spark-fast-tests, popular test helper libraries for the Spark ecosystem.

There are built-in Pandas testing methods that can also be used, but they don't provide error messages that are as easy to parse. The following sections compare the built-in Pandas output and what's output by Beavis, so you can choose for yourself.

Column comparisons

The built-in assert_series_equal method does not make it easy to decipher the rows that are equal and the rows that are different, so quickly fixing your tests and maintaining flow is hard.

Here's the built-in error message when comparing series that are not equal.

df = pd.DataFrame({"col1": [1042, 2, 9, 6], "col2": [5, 2, 7, 6]})
pd.testing.assert_series_equal(df["col1"], df["col2"])

>   ???
E   AssertionError: Series are different
E
E   Series values are different (50.0 %)
E   [index]: [0, 1, 2, 3]
E   [left]:  [1042, 2, 9, 6]
E   [right]: [5, 2, 7, 6]

Here's the beavis error message that aligns rows and highlights the mismatches in red.

import beavis

beavis.assert_pd_column_equality(df, "col1", "col2")

You can also compare columns in a Dask DataFrame.

ddf = dd.from_pandas(df, npartitions=2)
beavis.assert_dd_column_equality(ddf, "col1", "col2")

The assert_dd_column_equality error message is similarly descriptive.

DataFrame comparisons

The built-in pandas.testing.assert_frame_equal method doesn't output an error message that's easy to understand, see this example.

df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
pd.testing.assert_frame_equal(df1, df2)

E   AssertionError: DataFrame.iloc[:, 0] (column name="col1") are different
E
E   DataFrame.iloc[:, 0] (column name="col1") values are different (50.0 %)
E   [index]: [0, 1]
E   [left]:  [1, 2]
E   [right]: [5, 2]

beavis provides a nicer error message.

beavis.assert_pd_equality(df1, df2)

DataFrame comparison options:

check_index (default True)
check_dtype (default True)

Let's convert the Pandas DataFrames to Dask DataFrames and use the assert_dd_equality function to check they're equal.

ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)
beavis.assert_dd_equality(ddf1, ddf2)

These DataFrames aren't equal, so we'll get a good error message that's easy to debug.

Development

Install Poetry and run poetry install to create a virtual environment with all the Beavis dependencies on your machine.

Other useful commands:

poetry run pytest tests runs the test suite
poetry run black . to format the code
poetry build packages the library in a wheel file
poetry publish releases the library in PyPi (need correct credentials)

Comments

Add a license file

For other developers and also companies to be able to use this nice tool in a compliant way, it would be great, if you could add a license to this repository.

How to add a license to a project on GitHub: https://docs.github.com/en/communities/setting-up-your-project-for-healthy-contributions/adding-a-license-to-a-repository

Helper to choose the license for your use-case: https://choosealicense.com/

opened by StegSchreck 3
Remove the Dask dependency from this project

I'd like to make this a pandas-specific project. The Dask dependency & functionality can be abstracted to a separate lib. Let me know if there are any objections.

opened by MrPowers 0
Support for older versions of python 3

Hi, Really nice project :) Is it possible to add support for older version of python e.g. 3.7 (pypi package)? Btw. I didn't find information which version of pandas is supported.

opened by mglowacki100 0

Pandas and Dask test helper methods with beautiful error messages.

Related tags

Overview

beavis

test helpers

Column comparisons

DataFrame comparisons

Development

You might also like...

Pandas and Spark DataFrame comparison for humans

Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

An extension to pandas dataframes describe function.

Create HTML profiling reports from pandas DataFrame objects

Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Statistical package in Python based on Pandas

Bearsql allows you to query pandas dataframe with sql syntax.

Conduits - A Declarative Pipelining Tool For Pandas

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

Comments

Add a license file

Remove the Dask dependency from this project

Support for older versions of python 3

Owner

Matthew Powers

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

NumPy and Pandas interface to Big Data

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

A data analysis using python and pandas to showcase trends in school performance.

Calculate multilateral price indices in Python (with Pandas and PySpark).

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format