Pandas and Spark DataFrame comparison for humans

Capital One

Last update: Dec 24, 2022

Related tags

Overview

DataComPy

DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it prints out some stats, and lets you tweak how accurate matches have to be). Then extended to carry that functionality over to Spark Dataframes.

Quick Installation

pip install datacompy

Pandas Detail

DataComPy will try to join two dataframes either on a list of join columns, or on indexes. If the two dataframes have duplicates based on join values, the match process sorts by the remaining fields and joins based on that row number.

Column-wise comparisons attempt to match values even when dtypes don't match. So if, for example, you have a column with decimal.Decimal values in one dataframe and an identically-named column with float64 dtype in another, it will tell you that the dtypes are different but will still try to compare the values.

Basic Usage

from io import StringIO
import pandas as pd
import datacompy

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001239,1.05,Lucille Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

compare = datacompy.Compare(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0, #Optional, defaults to 0
    rel_tol=0, #Optional, defaults to 0
    df1_name='Original', #Optional, defaults to 'df1'
    df2_name='New' #Optional, defaults to 'df2'
    )
compare.matches(ignore_extra_columns=False)
# False

# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())

See docs for more detailed usage instructions and an example of the report output.

Things that are happening behind the scenes

You pass in two dataframes (df1, df2) to datacompy.Compare and a column to join on (or list of columns) to join_columns. By default the comparison needs to match values exactly, but you can pass in abs_tol and/or rel_tol to apply absolute and/or relative tolerances for numeric columns.
- You can pass in on_index=True instead of join_columns to join on the index instead.
The class validates that you passed dataframes, that they contain all of the columns in join_columns and have unique column names other than that. The class also lowercases all column names to disambiguate.
On initialization the class validates inputs, and runs the comparison.
Compare.matches() will return True if the dataframes match, False otherwise.
- You can pass in ignore_extra_columns=True to not return False just because there are non-overlapping column names (will still check on overlapping columns)
- NOTE: if you only want to validate whether a dataframe matches exactly or not, you should look at pandas.testing.assert_frame_equal. The main use case for datacompy is when you need to interpret the difference between two dataframes.
Compare also has some shortcuts like
- intersect_rows, df1_unq_rows, df2_unq_rows for getting intersection, just df1 and just df2 records (DataFrames)
- intersect_columns(), df1_unq_columns(), df2_unq_columns() for getting intersection, just df1 and just df2 columns (Sets)
You can turn on logging to see more detailed logs.

Spark Detail

DataComPy's SparkCompare class will join two dataframes either on a list of join columns. It has the capability to map column names that may be different in each dataframe, including in the join columns. You are responsible for creating the dataframes from any source which Spark can handle and specifying a unique join key. If there are duplicates in either dataframe by join key, the match process will remove the duplicates before joining (and tell you how many duplicates were found).

As with the Pandas-based Compare class, comparisons will be attempted even if dtypes don't match. Any schema differences will be reported in the output as well as in any mismatch reports, so that you can assess whether or not a type mismatch is a problem or not.

The main reasons why you would choose to use SparkCompare over Compare are that your data is too large to fit into memory, or you're comparing data that works well in a Spark environment, like partitioned Parquet, CSV, or JSON files, or Cerebro tables.

Performance Implications

Spark scales incredibly well, so you can use SparkCompare to compare billions of rows of data, provided you spin up a big enough cluster. Still, joining billions of rows of data is an inherently large task, so there are a couple of things you may want to take into consideration when getting into the cliched realm of "big data":

SparkCompare will compare all columns in common in the dataframes and report on the rest. If there are columns in the data that you don't care to compare, use a select statement/method on the dataframe(s) to filter those out. Particularly when reading from wide Parquet files, this can make a huge difference when the columns you don't care about don't have to be read into memory and included in the joined dataframe.
For large datasets, adding cache_intermediates=True to the SparkCompare call can help optimize performance by caching certain intermediate dataframes in memory, like the de-duped version of each input dataset, or the joined dataframe. Otherwise, Spark's lazy evaluation will recompute those each time it needs the data in a report or as you access instance attributes. This may be fine for smaller dataframes, but will be costly for larger ones. You do need to ensure that you have enough free cache memory before you do this, so this parameter is set to False by default.

Basic Usage

import datetime
import datacompy
from pyspark.sql import Row

# This example assumes you have a SparkSession named "spark" in your environment, as you
# do when running `pyspark` from the terminal or in a Databricks notebook (Spark v2.0 and higher)

data1 = [
    Row(acct_id=10000001234, dollar_amt=123.45, name='George Maharis', float_fld=14530.1555,
        date_fld=datetime.date(2017, 1, 1)),
    Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=1.0,
        date_fld=datetime.date(2017, 1, 1)),
    Row(acct_id=10000001236, dollar_amt=1345.0, name='George Bluth', float_fld=None,
        date_fld=datetime.date(2017, 1, 1)),
    Row(acct_id=10000001237, dollar_amt=123456.0, name='Bob Loblaw', float_fld=345.12,
        date_fld=datetime.date(2017, 1, 1)),
    Row(acct_id=10000001239, dollar_amt=1.05, name='Lucille Bluth', float_fld=None,
        date_fld=datetime.date(2017, 1, 1))
]

data2 = [
    Row(acct_id=10000001234, dollar_amt=123.4, name='George Michael Bluth', float_fld=14530.155),
    Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=None),
    Row(acct_id=10000001236, dollar_amt=1345.0, name='George Bluth', float_fld=1.0),
    Row(acct_id=10000001237, dollar_amt=123456.0, name='Robert Loblaw', float_fld=345.12),
    Row(acct_id=10000001238, dollar_amt=1.05, name='Loose Seal Bluth', float_fld=111.0)
]

base_df = spark.createDataFrame(data1)
compare_df = spark.createDataFrame(data2)

comparison = datacompy.SparkCompare(spark, base_df, compare_df, join_columns=['acct_id'])

# This prints out a human-readable report summarizing differences
comparison.report()

Using SparkCompare on EMR or standalone Spark

Set proxy variables
Create a virtual environment, if desired (virtualenv venv; source venv/bin/activate)
Pip install datacompy and requirements
Ensure your SPARK_HOME environment variable is set (this is probably /usr/lib/spark but may differ based on your installation)
Augment your PYTHONPATH environment variable with export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$SPARK_HOME/python:$PYTHONPATH (note that your version of py4j may differ depending on the version of Spark you're using)

Using SparkCompare on Databricks

Clone this repository locally
Create a datacompy egg by running python setup.py bdist_egg from the repo root directory.
From the Databricks front page, click the "Library" link under the "New" section.
On the New library page:
1. Change source to "Upload Python Egg or PyPi"
2. Under "Upload Egg", Library Name should be "datacompy"
3. Drag the egg file in datacompy/dist/ to the "Drop library egg here to upload" box
4. Click the "Create Library" button
Once the library has been created, from the library page (which you can find in your /Users/{login} workspace), you can choose clusters to attach the library to.
import datacompy in a notebook attached to the cluster that the library is attached to and enjoy!

Contributors

We welcome and appreciate your contributions! Before we can accept any contributions, we ask that you please be sure to sign the Contributor License Agreement (CLA).

This project adheres to the Open Source Code of Conduct. By participating, you are expected to honor this code.

Roadmap

Roadmap details can be found here

Comments

compare.df1_unq_rows and compare.df2_unq_rows don't provide the correct index of unique rows from respective dataFrames

Row indexes are not right while trying to find out unique rows which are missing in respective dataFrames. compare.df1_unq_rows and compare.df2_unq_rows returns the unique rows which are missing in respective dataFrames but their index are not returned correctly. Please have a look at this issue. I would be really great if the right indexes are returned on using mentioned functions.

opened by NiteshOO7 11

All Mismatches: Add options to exclude and include colum types

This PR adds the following three parameters to all_mismatched():

        """All rows with any columns that have a mismatch. Returns all df1 and df2 versions of the columns and join
        columns.

        Parameters
        ----------
        exclude_matched_columns : bool, optional
            When set to True, columns where all rows are matched are excluded.
        simplify_matched : bool, optional
            When set to True, only a single column will be included for
            columns that had no changes between the two dataframes.
        include_match_col : bool, optional
            Whether to include a `xxx_match` column that indicactes whether
            the column data has changed for the row.

        Returns
        -------
        Pandas.DataFrame
            All rows of the intersection dataframe, containing any columns, that don't match.
        """

Please let me know if you'd like me to make any changes before it gets merged and thanks for creating this great library.

opened by infused-kim 8

Doesnt take into account join columns when doing strip() of white spaces

Join columns can have whitespaces as well,so ideally below section should be enhanced to check for ignore_spaces and strip() if it is true for string columns This is causing complete mismatch of 2 dataframes here in case 1 of join column has space:

if on_index and join_columns is not None: raise Exception("Only provide on_index or join_columns") elif on_index: self.on_index = True self.join_columns = [] elif isinstance(join_columns, str): self.join_columns = [join_columns.lower()] self.on_index = False else: self.join_columns = [col.lower() for col in join_columns] self.on_index = False
help wanted good first issue question

opened by PABNY 7
Reporting more number of rows unique in source and not in target. Currently its limited to 10

How to get more rows shown in the report like for rows available in source but not in target, if I want to see may be 100 rows. How do I do that?

I tried using print(compare.df1_unq_rows(sample_count=50),file=tgtuniqrows) but it gave me error as print(compare.df1_unq_rows(sample_count=50),file=tgtuniqrows) TypeError: 'DataFrame' object is not callable

Can you please help here?
question

opened by sujitsatapathy84 6
Project CLA

@ryanshiroma Can you please sign the project CLA for datacompy project? If you plan to submit a PR in the near future, CLA-Assistant bot will prompt to to sign. If you aren't planning to submit any new PRs to the project, pls use this link to sign: https://docs.google.com/forms/d/e/1FAIpQLSfwtl1s6KmpLhCY6CjiY8nFZshDwf_wrmNYx1ahpsNFXXmHKw/viewform

We appreciate your contributions to the project and your understanding on the contribution requirement for CLA.

opened by tmbjmu 6
No. of rows shown in comparison are limited to 10.

'Sample Rows with Unequal Values' shows just 10 rows which are different. I want to see what changed in all the rows in the 2 dataframes, not just 10 rows.

Is there a way I can see all the unequal rows?

Please advise.
help wanted question

opened by ZlatGod 6
Drop Matching Rows and Columns in Mismatch Output for Readability.

Currently the csv that's being written out has all the columns written out on a failed comparison making it harder to read with a larger data-set. This is more of a feature request but the ability to only have columns that didn't match written to the mismatch csv would really help and speed up the QA process on our end.
enhancement help wanted good first issue

opened by alegarpa 6
Use less restrictive dependency version constraints.

Changes

We use datacompy as a library in the project using poetry to lock dependencies. And using the strict upper bounds for minor versions makes it hard to update our dependencies to include bug and security fixes. This PR changes the version constraints to be more permissive.

opened by kleschenko 5
Use OrderedSet to maintain DataFrame column order through set operations

@fdosani @KrishanBhasin

Fixes #81. Iteration on #107. This aims to maintain the original order when printing out the set of common columns as well as unique columns, both which are based on set operations.

opened by gandhis1 5
Understanding Intersect_rows when df1 and df2 have one-to-many relationships
If I am trying to compare the intersect rows for the following two dataframes:

df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'], 'value': [1, 2, 3, 5] }) df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz'], 'value': [5, 6, 7]}) compare = datacompy.Compare( df1, df2, join_columns=['key'], #You can also specify a list of columns abs_tol=0, rel_tol=0, df1_name='df1', df2_name='df2')

compare.intersect_rows only return the following rows:

key | value_df1 | value_df2 | _merge | value_match -- | -- | -- | -- | -- foo | 1 | 5 | both | False bar | 2 | 6 | both | False baz | 3 | 7 | both | False

While I am expecting an additional row:

key | value_df1 | value_df2 | _merge | value_match -- | -- | -- | -- | -- foo | 5 | 5 | both | True

In fact, when you use pandas merge on df1 and df2, you do see the last row in df1:

df1.merge(df2, how="outer", suffixes=("_df1", "_df2"), indicator=True, on='key')

key | value_df1 | value_df2 | _merge -- | -- | -- | -- foo | 1 | 5 | both foo | 5 | 5 | both bar | 2 | 6 | both baz | 3 | 7 | both

Why is the last row from df1 missing from intersect_row dataframe?
opened by mangoLily 5
Use type hinting & support Python 3.5+
Since we target Python 3.5+ in our CI testing, we can make this explicit and commit to only supporting that version and higher. This will enable us to use type hinting in our code, which would have a few benefits:

IDE support for types while developing datacompy

IDE support for types for users using datacompy

Ability to do static type checking in our CI pipeline

Automatic type inference during documentation generation

This should probably best be done in concert with something like #13 as long as we're messing with the code, but since this will involve committing to supporting certain versions of Python, it should be a conscious choice. Python 3.4 reached EOL in March 2019 so I don't see a problem with explicitly targeting 3.5+.
opened by dan-coates 5

Replace called_with with assert_called_with

called_with is an incorrect method that just returns a Mock object that evaluates to True and doesn't perform appropriate assertion. This method should be replaced with assert_called_with . Related CPython issue : https://github.com/python/cpython/issues/100690 . Changes cause below test failures.

pytest tests/test_core.py           
==================================================================== test session starts =====================================================================
platform linux -- Python 3.11.1, pytest-7.2.0, pluggy-1.0.0
rootdir: /root/checked_repos_clone_3900_4000/datacompy, configfile: pytest.ini
plugins: cov-2.12.1
collected 81 items                                                                                                                                           

tests/test_core.py .............................FF..................................................                                                   [100%]

========================================================================== FAILURES ==========================================================================
________________________________________________________________________ test_subset _________________________________________________________________________

mock_debug = <MagicMock name='debug' id='140448511508688'>

    @mock.patch("datacompy.logging.debug")
    def test_subset(mock_debug):
        df1 = pd.DataFrame([{"a": 1, "b": 2, "c": "hi"}, {"a": 2, "b": 2, "c": "yo"}])
        df2 = pd.DataFrame([{"a": 1, "c": "hi"}])
        comp = datacompy.Compare(df1, df2, ["a"])
        assert comp.subset()
>       mock_debug.assert_called_with("Checking equality")

tests/test_core.py:529: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <MagicMock name='debug' id='140448511508688'>, args = ('Checking equality',), kwargs = {}, expected = "debug('Checking equality')"
actual = 'not called.', error_message = "expected call not found.\nExpected: debug('Checking equality')\nActual: not called."

    def assert_called_with(self, /, *args, **kwargs):
        """assert that the last call was made with the specified arguments.
    
        Raises an AssertionError if the args and keyword args passed in are
        different to the last call to the mock."""
        if self.call_args is None:
            expected = self._format_mock_call_signature(args, kwargs)
            actual = 'not called.'
            error_message = ('expected call not found.\nExpected: %s\nActual: %s'
                    % (expected, actual))
>           raise AssertionError(error_message)
E           AssertionError: expected call not found.
E           Expected: debug('Checking equality')
E           Actual: not called.

/usr/lib/python3.11/unittest/mock.py:924: AssertionError
______________________________________________________________________ test_not_subset _______________________________________________________________________

mock_info = <MagicMock name='info' id='140448491660176'>

    @mock.patch("datacompy.logging.info")
    def test_not_subset(mock_info):
        df1 = pd.DataFrame([{"a": 1, "b": 2, "c": "hi"}, {"a": 2, "b": 2, "c": "yo"}])
        df2 = pd.DataFrame([{"a": 1, "b": 2, "c": "hi"}, {"a": 2, "b": 2, "c": "great"}])
        comp = datacompy.Compare(df1, df2, ["a"])
        assert not comp.subset()
>       mock_info.assert_called_with("Sample c mismatch: a: 2, df1: yo, df2: great")

tests/test_core.py:538: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <MagicMock name='info' id='140448491660176'>, args = ('Sample c mismatch: a: 2, df1: yo, df2: great',), kwargs = {}
expected = "info('Sample c mismatch: a: 2, df1: yo, df2: great')", actual = 'not called.'
error_message = "expected call not found.\nExpected: info('Sample c mismatch: a: 2, df1: yo, df2: great')\nActual: not called."

    def assert_called_with(self, /, *args, **kwargs):
        """assert that the last call was made with the specified arguments.
    
        Raises an AssertionError if the args and keyword args passed in are
        different to the last call to the mock."""
        if self.call_args is None:
            expected = self._format_mock_call_signature(args, kwargs)
            actual = 'not called.'
            error_message = ('expected call not found.\nExpected: %s\nActual: %s'
                    % (expected, actual))
>           raise AssertionError(error_message)
E           AssertionError: expected call not found.
E           Expected: info('Sample c mismatch: a: 2, df1: yo, df2: great')
E           Actual: not called.

/usr/lib/python3.11/unittest/mock.py:924: AssertionError
====================================================================== warnings summary ======================================================================
../../checked_repos/taskcat/.env/lib/python3.11/site-packages/_pytest/config/__init__.py:1294
  /root/checked_repos/taskcat/.env/lib/python3.11/site-packages/_pytest/config/__init__.py:1294: PytestConfigWarning: Unknown config option: spark_options
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

tests/test_core.py:29
  /root/checked_repos_clone_3900_4000/datacompy/tests/test_core.py:29: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
    from pandas.util.testing import assert_series_equal

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================== short test summary info ===================================================================
FAILED tests/test_core.py::test_subset - AssertionError: expected call not found.
FAILED tests/test_core.py::test_not_subset - AssertionError: expected call not found.
========================================================== 2 failed, 79 passed, 2 warnings in 1.69s ==========================================================

opened by tirkarthi 0

bump up minimum python version to 3.8
There are some security vulns which keep coming up with Python 3.7 support. In the next release (0.9.x) we should drop Python 3.7 support. Should be an easy change and fix.

Following: https://github.com/capitalone/datacompy/blob/develop/setup.cfg#L14 should become >=3.8.0

Doc / README updates would be needed also

enhancement help wanted good first issue
opened by fdosani 0
Cannot interpret 'datetime64[ns, UTC]' as a data type

Any idea when support for datetime64[ns, UTC] will be implemented ? I get this error Cannot interpret 'datetime64[ns, UTC]' as a data type when using compare function
enhancement help wanted good first issue

opened by Rstar1998 2
Mismatch between Summary Report and Actual

Summary report shows there are records in base and compare dataset that did not have corresponding matches

****** Row Summary ****** Number of rows in common: 2971140 Number of rows in base but not compare: 41536 Number of rows in compare but not base: 41536** Number of duplicate rows found in base: 3721 Number of duplicate rows found in compare: 3721

but when i try to get the records in base and/or compare that do not have matches.. it returns 0...

Using latest Spark version to do comparison...

Any thoughts/suggestions on what might be the issue. I was hoping to see 41536 records for compare.rows_only_compare
bug question

opened by guptaat 8

Releases(v0.8.3)

v0.8.3(Nov 2, 2022)
What's Changed

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/168

updating download badge by @fdosani in https://github.com/capitalone/datacompy/pull/169

bump version to v0.8.3 by @fdosani in https://github.com/capitalone/datacompy/pull/171

Update ROADMAP.rst by @fdosani in https://github.com/capitalone/datacompy/pull/174

Release v0.8.3 by @fdosani in https://github.com/capitalone/datacompy/pull/172

Full Changelog: https://github.com/capitalone/datacompy/compare/v0.8.2...v0.8.3
Source code(tar.gz)
Source code(zip)
v0.8.2(Oct 19, 2022)
What's Changed

WhiteSource Configuration Migration by @mend-for-github-com in https://github.com/capitalone/datacompy/pull/145

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/149

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/150

Branch updates and HTML Report by @fdosani in https://github.com/capitalone/datacompy/pull/152

Edgetest update by @fdosani in https://github.com/capitalone/datacompy/pull/153

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/156

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/158

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/161

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/162

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/164

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/165

To address issue #42, added a parameter to the all_mismatch method to control the addition of matching columns to the output. by @azadekhalaj in https://github.com/capitalone/datacompy/pull/166

Release v0.8.2 by @fdosani in https://github.com/capitalone/datacompy/pull/167

New Contributors

@mend-for-github-com made their first contribution in https://github.com/capitalone/datacompy/pull/145

@azadekhalaj made their first contribution in https://github.com/capitalone/datacompy/pull/166

Full Changelog: https://github.com/capitalone/datacompy/compare/v0.8.1...v0.8.2
Source code(tar.gz)
Source code(zip)
v0.8.1(Apr 19, 2022)
What's Changed

Create publish-package.yml by @fdosani in https://github.com/capitalone/datacompy/pull/130

adding edgetest action by @fdosani in https://github.com/capitalone/datacompy/pull/131

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/132

Create whitesource.config by @fdosani in https://github.com/capitalone/datacompy/pull/138

Changes by run-edgetest action by @github-actions in https://github.com/capitalone/datacompy/pull/137

updating whitesource config by @fdosani in https://github.com/capitalone/datacompy/pull/139

bump version to 0.8.1 by @fdosani in https://github.com/capitalone/datacompy/pull/140

Release 0.8.1 by @fdosani in https://github.com/capitalone/datacompy/pull/141

New Contributors

@github-actions made their first contribution in https://github.com/capitalone/datacompy/pull/132

Full Changelog: https://github.com/capitalone/datacompy/compare/v0.8.0...v0.8.1
Source code(tar.gz)
Source code(zip)
v0.8.0(Mar 2, 2022)
What's Changed

Adding edgetest, general house keeping with GitHub Actions, and dropping Python 3.6 testing/support

GitHub Action updates:

#119

#120

#126

Edgetest updates:

#125

Full Changelog: https://github.com/capitalone/datacompy/compare/v0.7.3...v0.8.0
Source code(tar.gz)
Source code(zip)
v0.7.3(Oct 25, 2021)
Release v0.7.3:

Make column order in reports determinitistic (alphabetical) (#107)

Use OrderedSet to maintain DataFrame column order throughout comparison set operations (#108)

General house keeping (#109, #110, #111, #113)

adding column count (#116)

Source code(tar.gz)
Source code(zip)
v0.7.2(Feb 11, 2021)

Release v0.7.2:

Updates to CODEOWNERS (#80, #86) Bumped min pandas version to 0.25 (#92) Move from Travis to GH Actions (#90, #91, #96) Doc cleanup (#98, #100) Unit test cleanup (#99) Suppress "Sample Rows" section in the output report (#102)
Source code(tar.gz)
Source code(zip)
v0.7.1(Jul 21, 2020)
Release v0.7.1

Lower casing columns and casting as string (#69)

Adding whitesource config (#70)

Source code(tar.gz)
Source code(zip)
v0.7.0(Jun 9, 2020)
dropping Python 2 support (3.5+ only)

Add datacompy[spark] pip install option (#54)

strip spaces in join columns (#62)

new function called all_mismatch. This will provide all rows which mismatch back as a dataframe so users can export, query, or analyze if there are a lot of them. (#64)

create MANIFEST.in (#66)

Source code(tar.gz)
Source code(zip)
v0.6.0(Jan 25, 2019)

Small bug fix for Python 2.7 installations.
Source code(tar.gz)
Source code(zip)
v0.5.2(Jan 23, 2019)
Changes since 0.5.1:

Added ignore_spaces and ignore_case flags for more flexible string comparisons

Fixed a bug (#35) with duplicate matching when nulls are present in the join columns

Added in pre-commit and black for code formatting

Source code(tar.gz)
Source code(zip)
v0.5.1(May 19, 2018)

Adding in rel_tol, abs_tol, show_all_columns, and match_rates for SparkCompare
Source code(tar.gz)
Source code(zip)
v0.5.0(Mar 27, 2018)

First release to public GitHub!
Source code(tar.gz)
Source code(zip)

Owner

Capital One

We’re an open source-first organization — actively using, contributing to and managing open source software projects.

GitHub https://capitalone.github.io/datacompy/

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

5 Sep 28, 2022

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

2 Dec 1, 2021

Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 1, 2023

Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

1.1k Dec 28, 2022

Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

14 Jun 22, 2022

Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

249 Jan 8, 2023

Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

2 Dec 19, 2021

A data structure that extends pyspark.sql.DataFrame with metadata information.

MetaFrame A data structure that extends pyspark.sql.DataFrame with metadata info

8 Feb 15, 2022

Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

1 Nov 22, 2021

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

1 Dec 6, 2021

This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark application on Spark container

1 Dec 29, 2021

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

2 Dec 4, 2021

The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

1 Dec 9, 2021

Pyspark project that able to do joins on the spark data frames.

SPARK JOINS This project is to perform inner, all outer joins and semi joins. create_df.py: load_data.py : helps to put data into Spark data frames. d

1 Dec 14, 2021

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

1 Jan 6, 2022

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

3.3k Jan 4, 2023

NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

3.1k Jan 5, 2023

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

98 Dec 31, 2022

Pandas and Spark DataFrame comparison for humans

Related tags

Overview

DataComPy

Quick Installation

Pandas Detail

Basic Usage

Things that are happening behind the scenes

Spark Detail

Performance Implications

Basic Usage

Using SparkCompare on EMR or standalone Spark

Using SparkCompare on Databricks

Contributors

Roadmap

Comments

Changes

Releases(v0.8.3)

v0.8.3(Nov 2, 2022)

What's Changed

v0.8.2(Oct 19, 2022)

What's Changed

New Contributors

v0.8.1(Apr 19, 2022)

What's Changed

New Contributors

v0.8.0(Mar 2, 2022)

What's Changed

v0.7.3(Oct 25, 2021)

v0.7.2(Feb 11, 2021)

v0.7.1(Jul 21, 2020)

v0.7.0(Jun 9, 2020)

v0.6.0(Jan 25, 2019)

v0.5.2(Jan 23, 2019)

v0.5.1(May 19, 2018)

v0.5.0(Mar 27, 2018)

Owner

Capital One

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Create HTML profiling reports from pandas DataFrame objects

Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Bearsql allows you to query pandas dataframe with sql syntax.

Random dataframe and database table generator

Important dataframe statistics with a single command

A data structure that extends pyspark.sql.DataFrame with metadata information.

Building house price data pipelines with Apache Beam and Spark on GCP

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

This mini project showcase how to build and debug Apache Spark application using Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

The Spark Challenge Student Check-In/Out Tracking Script

Pyspark project that able to do joins on the spark data frames.

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

NumPy and Pandas interface to Big Data

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.