Monitor the stability of a pandas or spark dataframe ⚙︎

ING Bank

Last update: Dec 7, 2022

Related tags

Data Analysis python tracking data-science statistics spark monitoring jupyter ipython pandas data-analysis statistical-tests hacktoberfest data-profiling ing-bank statistical-process-control mlops data-distributions popmon population-monitoring

Overview

Population Shift Monitoring

popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using monitoring business rules.

Announcements

Spark 3.0

With Spark 3.0, based on Scala 2.12, make sure to pick up the correct histogrammar jar files:

spark = SparkSession.builder.config(
    "spark.jars.packages",
    "io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20",
).getOrCreate()

For Spark 2.X compiled against scala 2.11, in the string above simply replace 2.12 with 2.11.

Examples

Documentation

The entire popmon documentation including tutorials can be found at read-the-docs.

Notebooks

Tutorial	Colab link
Basic tutorial
Detailed example (featuring configuration, Apache Spark and more)
Incremental datasets (online analysis)
Report interpretation (step-by-step guide)

Check it out

The popmon library requires Python 3.6+ and is pip friendly. To get started, simply do:

$ pip install popmon

or check out the code from our GitHub repository:

$ git clone https://github.com/ing-bank/popmon.git
$ pip install -e popmon

where in this example the code is installed in edit mode (option -e).

You can now use the package in Python with:

import popmon

Congratulations, you are now ready to use the popmon library!

Quick run

As a quick example, you can do:

import pandas as pd
import popmon
from popmon import resources

# open synthetic data
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
df.head()

# generate stability report using automatic binning of all encountered features
# (importing popmon automatically adds this functionality to a dataframe)
report = df.pm_stability_report(time_axis="date", features=["date:age", "date:gender"])

# to show the output of the report in a Jupyter notebook you can simply run:
report

# or save the report to file
report.to_file("monitoring_report.html")

To specify your own binning specifications and features you want to report on, you do:

# time-axis specifications alone; all other features are auto-binned.
report = df.pm_stability_report(
    time_axis="date", time_width="1w", time_offset="2020-1-6"
)

# histogram selections. Here 'date' is the first axis of each histogram.
features = [
    "date:isActive",
    "date:age",
    "date:eyeColor",
    "date:gender",
    "date:latitude",
    "date:longitude",
    "date:isActive:age",
]

# Specify your own binning specifications for individual features or combinations thereof.
# This bin specification uses open-ended ("sparse") histograms; unspecified features get
# auto-binned. The time-axis binning, when specified here, needs to be in nanoseconds.
bin_specs = {
    "longitude": {"bin_width": 5.0, "bin_offset": 0.0},
    "latitude": {"bin_width": 5.0, "bin_offset": 0.0},
    "age": {"bin_width": 10.0, "bin_offset": 0.0},
    "date": {
        "bin_width": pd.Timedelta("4w").value,
        "bin_offset": pd.Timestamp("2015-1-1").value,
    },
}

# generate stability report
report = df.pm_stability_report(features=features, bin_specs=bin_specs, time_axis=True)

These examples also work with spark dataframes. You can see the output of such example notebook code here. For all available examples, please see the tutorials at read-the-docs.

Pipelines for monitoring dataset shift

Advanced users can leverage popmon's modular data pipeline to customize their workflow. Visualization of the pipeline can be useful when debugging, or for didactic purposes. There is a script included with the package that you can use. The plotting is configurable, and depending on the options you will obtain a result that can be used for understanding the data flow, the high-level components and the (re)use of datasets.

Example pipeline visualization (click to enlarge)

Resources

Presentations

Title	Host	Date	Speaker
Popmon - population monitoring made easy	Big Data Technology Warsaw Summit 2021	February 25, 2021	Simon Brugman
Popmon - population monitoring made easy	Data Lunch @ Eneco	October 29, 2020	Max Baak, Simon Brugman
Popmon - population monitoring made easy	Data Science Summit 2020	October 16, 2020	Max Baak
Population Shift Monitoring Made Easy: the popmon package	Online Data Science Meetup @ ING WBAA	July 8 2020	Tomas Sostak
Popmon: Population Shift Monitoring Made Easy	PyData Fest Amsterdam 2020	June 16, 2020	Tomas Sostak
Popmon: Population Shift Monitoring Made Easy	Amundsen Community Meetup	June 4, 2020	Max Baak

Articles

Title	Date	Author
Population Shift Analysis: Monitoring Data Quality with Popmon	May 21, 2021	Vito Gentile
Popmon Open Source Package — Population Shift Monitoring Made Easy	May 20, 2020	Nicole Mpozika

Project contributors

This package was authored by ING Wholesale Banking Advanced Analytics. Special thanks to the following people who have contributed to the development of this package: Ahmet Erdem, Fabian Jansen, Nanne Aben, Mathieu Grimal.

Contact and support

Issues & Ideas & Support: https://github.com/ing-bank/popmon/issues

Please note that ING WBAA provides support only on a best-effort basis.

License

Comments

feat: hist_juxtaposition

For now, the last_n is by default set to 2. Therefore, only two dates would appear in the dropdown. For the airline dataset if the last_n is set to max, popmon runs into the issue (for DEPARTURE feature) raised by Tomek https://github.com/ing-bank/popmon/issues/244.

closes ing-bank/popmon#230
enhancement

opened by pradyot-09 7
Error when stitching histograms

Discussed in https://github.com/ing-bank/popmon/discussions/142

^{Originally posted by jeaninejuliettes September 29, 2021} Hello,

I'm receiving an error when using stitch_histogram and I'm not sure what I'm doing wrong, hope anyone can help me. The error I get is: ValueError: Request to insert delta hists but time_bin_idx not set. Please do.

The steps I take:

I start with creating a histogrammar object of the original dataframe

hists = df.pm_make_histograms() bin_specs = popmon.get_bin_specs(hists)

later on I receive a new batch of data, which I add to my existing histograms

new_hists = [new_df.pm_make_histograms(bin_specs=bin_specs)] hists_2= popmon.stitch_histograms(hists_basis=hists, hists_delta=new_hists, time_axis="batch")

so far so good, but when I try to repeat these steps with yet another new batch of data, I receive the error

new_hists_2 = [new_df_2.pm_make_histograms(bin_specs=bin_specs)] hists_3 = popmon.stitch_histograms(hists_basis=hists_2, hists_delta=new_hists_2, time_axis="batch")

Is it not possible to stitch another histogram again? If not, I've found a bit of a cumbersome way to decide on what a good value for my time_bin_idx is. It works so far, but I'm expecting it too fail with other data (or not to work as expected). The way I define the time_bin_idx value is: int(np.ceil(max(hists_2[next(iter(hists_2))].bin_centers()) + 1))

Hopefully you can point me in the right direction. Thanks!

opened by mbaak 6
Error: cannot import name 'Report' from 'popmon.config'

Code:

import popmon from popmon import resources from popmon.config import Report

Got error: ImportError Traceback (most recent call last) /tmp/ipykernel_707/1841834346.py in 3 import popmon 4 from popmon import resources ----> 5 from popmon.config import Report, Setting

ImportError: cannot import name 'Report' from 'popmon.config' (/home/user/.local/lib/python3.7/site-packages/popmon/config.py)

opened by lcheng61 4
Error with pydantic when using some custom settings in the report generation

With version 1.0.0, when using custom settings in df.pm_stability_report() like show_stats, I get an error stating such option is not allowed:

ValidationError: 2 validation errors for Settings

I couldn't reproduce it when using popmon==0.9.0.

opened by gus-morales 3
DataProfiler - A Scalable Data Profiling Library
Howdy!

I'm reaching out as a maintainer of the DataProfiler library.

I think it might be useful to your project so I'm reaching out! Would love to collaborate and see how we can help popmon.

We effectively wrote a library to improve upon the objectives of pandas-profiling with some neat added functionality:

Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL data = Data("your_filepath_or_url.csv")

Profile data: calculating statistics and doing entity detection (for PII) profile = Profiler(data)

Merge profiles: profile3 = profile1 + profile2; enabling distributed profile generation

Compare profiles: profile_diff = profile1.diff(profile2)

Generate reports: readable_report = profile.report(report_options={"output_format": "compact"})

import json from dataprofiler import Data, Profiler data = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc readable_report = profile.report(report_options={"output_format": "compact"}) print(json.dumps(readable_report, indent=4))
opened by lettergram 3
Library doesn't run in Spark 3.0+: Replace the dependency of histogrammar with native Spark functionality

Currently, the dependency with the library, which hasn't been further developed since 2016, creates a dependency with Scala 2.11 which limits the execution in Spark 3.0 (which was only built on Scala 2.12). I think I could replace the functionality with Bucketizer functionality in native spark.

opened by kedemdor 3
Imports Optimized

isort helps you to sort your import list. It simply optimized the script and increases the readability.

There is no big change in the concept. Algorithms are still working as well.

I'm contributing for Hacktoberfest. I will appreciate it if you add the "Hacktoberfest" label to this PR. :) Thanks.

opened by lnxpy 3
missing tutorial datasets
Hi, awesome tool!

Advanced tutorial datasets are not in test_data dir, but still in notebooks dir, as far as I can see. Hence the advanced tutorial notebooks don't run out of the box, at least for me. I don't have permissions to push to a develop branch.

Changes to be committed: (use "git reset HEAD ..." to unstage)

renamed: popmon/notebooks/flight_delays.csv.gz -> popmon/test_data/flight_delays.csv.gz renamed: popmon/notebooks/flight_delays_reference.csv.gz -> popmon/test_data/flight_delays_reference.csv.gz

Cheers - Alex
opened by AlexKoutsman 3
build(deps): update docutils requirement from <0.17 to <0.20
Updates the requirements on docutils to permit the latest version.

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 2
Feat/plotly express
The histograms, heatmaps and comparisons have been replaced with interactive Plotly graphs. Plotly.js is used to build the graphs on the go from json.

Initial tests show that plotly reports are smaller in size compared to matplotlib and the takes way less time for report generation compared to matplotlib.

use parameter 'online_report' to use plotly.js from cdn server and use report online. Else, plotly.js is embedded in the report and can be used offline too.

closes ing-bank/popmon#164
enhancement reporting
opened by pradyot-09 2
Feature/heatmap time series plotting

Added heatmap feature to visualize features over time for EDA.

User can set Heatmap color map by giving the 'cmap' argument pm_stability report(). User can set top_n argument to deal with high cardinality. User can disable specific heatmap by giving heatmap name in the disable_heatmap[] argument in the pm_stability_report().

closes ing-bank/popmon#185 closes ing-bank/popmon#199

opened by pradyot-09 2
Rolling reference comparisons

A wide variety of references is provided by popmon out-of-the-box. A reference may be static (a fixed training set, or the current dataset itself for exploratory data analysis) or dynamic (sliding or growing as more data becomes available). The reference is compared against batches, and they can be sequential (batched) or sliding (rolling).

Popmon should enable all combinations, and currently lacks external reference + rolling comparison.

| | Reference | Compare to | Implemented | |---|---|---|---| | Self-reference | Static | Self (batched) | ✓ | | External reference | Static | Batched | ✓ | | Rolling reference | Rolling | Rolling/sliding | ✓ | | Expanding reference | Expanding | Rolling/sliding | ✓ | | External reference | Static | Rolling/sliding | ✗ |

Thanks to @LorenaPoenaru!
enhancement

opened by sbrugman 0
code coverage of 100%

The risk of breaking functionality on introducing new features could be reduced by ensuring that each line of code is covered by the tests and that this is enforced at test time. Other repos, such as this also use this.

For that, we can include pytest-cov to the test dependencies and increase the test coverage until it passes (see this annswer).
good first issue help wanted CI internal improvement

opened by sbrugman 0
Traffic light boundaries for count variables

The traffic light bounds provided by the pull/Z-score calculation are symmetrical. For count variables this can lead to bounds outside the constraints (below zero).
enhancement statistics

opened by sbrugman 0
Reject unsupported column types

Running popmon on a DataFrame with columns containing mutable sequences, tuples or sets generates cryptic errors. popmon should return an error message.
enhancement good first issue API

opened by sbrugman 0

Releases(v1.4.0)

v1.4.0(Oct 19, 2022)
Feature

report: Summary table (a5b9a30)

Source code(tar.gz)
Source code(zip)
popmon-1.4.0-py3-none-any.whl(2.73 MB)
popmon-1.4.0.tar.gz(2.67 MB)
v1.3.0(Sep 9, 2022)
Feature

Remove skip_empty_plots (bd3ea29)

Fix

comparisons: Unknown labels correct label axis (e0bbf04)

report: Show zero value in bar plots (6a350d4)

Documentation

readme: Fix table formatting (d6958c6)

readme: Include scipy presentation (695ee17)

Source code(tar.gz)
Source code(zip)
popmon-1.3.0-py3-none-any.whl(2.73 MB)
popmon-1.3.0.tar.gz(2.67 MB)
v1.2.0(Sep 1, 2022)
Feature

report: Histogram inspector (5e78f98)

Remove time of day from label when midnight or noon (1615bed)

config: Deprecate skip_empty_plots (#249) (372ef85)

Fix

Show heatmap descriptions (64a4952)

Documentation

readme: Include histogram inspector in readme (3e68508)

Source code(tar.gz)
Source code(zip)
popmon-1.2.0-py3-none-any.whl(2.73 MB)
popmon-1.2.0.tar.gz(2.67 MB)
v1.1.0(Aug 19, 2022)
Feature

Extension functionality + diptest implementation @RUrlus (8487991)

Documentation

config: Extend settings documentation (fa4d2fc)

api: Restructure api documentation for clarity (affdd75)

profiles: Profile extension (e859530)

comparisons: Comparison extensions section (581e63c)

Extensions in index (6c87fcb)

readme: Add section about profile integrations and diptest (aa860a7)

readme: Include citation information (#241) (3a29135)

Source code(tar.gz)
Source code(zip)
popmon-1.1.0-py3-none-any.whl(2.73 MB)
popmon-1.1.0.tar.gz(2.67 MB)
v1.0.0(Jul 8, 2022)
Feature

report: Group comparisons by reference key (#237) (e813534)

Entropy profile (96dd7d1)

Configurable title and colors (37fcd0e)

Online report CDN for bootstrap, jquery (72df86d)

Plotly express (2c2395c)

Introduce self start reference type (ca22268)

String representation for base classes (bd480a6)

Keep section when changing features (7b12d06)

config: Settings required parameter (47d6b17)

registry: Generalize registry (d01c68a)

registry: Add ks, pearson, chi2 to registry (4f8126d)

config: Structured config using pydantic (bc52aeb)

Fix

Prevent plot name collision that stops rendering (#238) (32c7ef4)

Set time_width in synthetic data example (fbf7e41)

Guaranteed ordering of traffic light metrics (88c3f4a)

plot: Plot_heatmap_b64 top argument is now supported (8be8115)

Breaking

matplotlib-related config is removed (2c2395c)

Configuration of time_axis, features etc. is moved to the Settings class. (ca22268)

new configuration syntax (bc52aeb)

the plot_metrics and plot_overview settings are no longer available for Tl/alerts (1c4f072)

the worst entry will no long be present in the datastore (e2b9ef7)

Documentation

profiles: Add entropy (8e8ac00)

Registry snippet to list available profiles/comparisons (7760047)

comparisons: Overview in table (cde48c4)

Profiles in table (65459af)

Description for mean trend score (b5b7626)

Update api structure (c6c53ba)

Documentation for reference types (9cf8117)

Update api documentation (74eb223)

config: Configuration parameter docstrings (9eec883)

registry: Instructions on comparison parameter setting (2e3b134)

config: Update configuration examples (daccc36)

Performance

Reduce file size of reports (329564f)

Source code(tar.gz)
Source code(zip)
popmon-1.0.0-py3-none-any.whl(2.72 MB)
popmon-1.0.0.tar.gz(2.67 MB)
v0.10.2(Jun 21, 2022)
Fix

Patched histogrammar bin_edges call for Bin histograms (590d266)

Source code(tar.gz)
Source code(zip)
popmon-0.10.2-py3-none-any.whl(1.68 MB)
popmon-0.10.2.tar.gz(1.63 MB)
v0.10.1(Jun 15, 2022)
Fix

Patched histogrammar num_bins call for Bin histograms (b34fe70)

Source code(tar.gz)
Source code(zip)
popmon-0.10.1-py3-none-any.whl(1.68 MB)
popmon-0.10.1.tar.gz(1.63 MB)
v0.10.0(Jun 14, 2022)
Feature

profiles: Custom profiles via registry pattern (d0eb98b)

Fix

Protection against outliers in sparse histograms (#215) (10c3449)

report: Traffic light flexbox on small screens (2faa7da)

Documentation

synthetic: Update synthetic examples (#212) (84a9331)

readme: Link profiles and comparisons page (17ac6d8)

profiles: List the available profiles (#173) (15f78ec)

Source code(tar.gz)
Source code(zip)
popmon-0.10.0-py3-none-any.whl(1.68 MB)
popmon-0.10.0.tar.gz(1.63 MB)
v0.9.0(May 27, 2022)
Feature

report: Enable the overview section by default (22b9cb6)

report: Overview section for quickly navigating reports (f5736d4)

report: Allow section without feature toggle (2484569)

Fix

report: Consistent use of color red (453f3fe)

report: Text contrast and consistent yellow traffic light (5d5c43c)

Documentation

readme: Replace report image (8d363d5)

synthetic: Add dataset overview table (8654347)

datasets: Reference implementations for widely-used publicly available synthetic data streams (9988a13)

Source code(tar.gz)
Source code(zip)
popmon-0.9.0-py3-none-any.whl(1.68 MB)
popmon-0.9.0.tar.gz(1.63 MB)
v0.8.0(May 20, 2022)
Feature

report: Heatmap time-series for categoricals (#194) (21c4ad1)

Nd histogram comparisons and profiles (d572f7f)

Dashboarding integration for Kibana (83b8869)

Dashboarding integration for Kibana (4a9284f)

config: Global configuration for ing_matplotlib_theme (c81e28f)

Fix

Import histogrammar specialized (d70ab80)

Documentation

config: Global configuration for ing_matplotlib_theme (6f4f20d)

Performance

Directly use bin keys (9440897)

Disable parallel processing by default (85d4407)

Chi2 max residual using numpy (8596387)

Performant data structure (ff72d6e)

Short circuit any/all (63c2704)

Optimize pull computation (ddf2e35)

Postpone formatting (expensive for DataFrames) (7feaaae)

Compute metrics without report (254564c)

Source code(tar.gz)
Source code(zip)
popmon-0.8.0-py3-none-any.whl(1.68 MB)
popmon-0.8.0.tar.gz(1.63 MB)
v0.7.0(May 9, 2022)
Feature

Global configuration of joblib Parallel backend (3431cad)

Fix

Prevent numpy warnings (9ec3b66)

Documentation

config: Document global configuration (e546994)

readme: Extend articles section (0ec0273)

Source code(tar.gz)
Source code(zip)
popmon-0.7.0-py3-none-any.whl(1.67 MB)
popmon-0.7.0.tar.gz(1.63 MB)
v0.6.1(Apr 30, 2022)
Fix

plot: Fixing memory leak in matplotlib multithreading (cc6c4e1)

Documentation

Include link to kedro-popmon (aff68b7)

Source code(tar.gz)
Source code(zip)
popmon-0.6.1-py3-none-any.whl(1.67 MB)
popmon-0.6.1.tar.gz(1.63 MB)
v0.6.0(Dec 13, 2021)
Feature

comparisons: Introduce psi and jsd (c6a1ca7)

comparisons: Introduce comparison registry (031c146)

Documentation

comparisons: Add comparisons page (60967c9)

Fix broken link (2b38fc6)

rtd: Install popmon (e9c4610)

Source code(tar.gz)
Source code(zip)
popmon-0.6.0-py3-none-any.whl(1.67 MB)
popmon-0.6.0.tar.gz(1.63 MB)
v0.5.0(Nov 24, 2021)
Feature

Improve pipeline visualization (bb09d73)

Fix

Ensure uniqueness of apply_funcs_key (ba98c97)

Documentation

rtd: Migrate config to v2 (a8d9f76)

Refresh notebooks (#151) (0bccc7e)

Pipeline visualizations in docs and notebooks (913bfb0)

Changelog md syntax (b187d36)

Specify requirements (e3f6b0a)

Source code(tar.gz)
Source code(zip)
popmon-0.5.0-py3-none-any.whl(1.67 MB)
popmon-0.5.0.tar.gz(1.62 MB)
v0.4.4(Oct 22, 2021)

Release notes are available here.
Source code(tar.gz)
Source code(zip)
v0.4.3(Oct 4, 2021)

Release notes are available here.
Source code(tar.gz)
Source code(zip)
v0.4.2(Aug 26, 2021)

Release notes are available here.
Source code(tar.gz)
Source code(zip)
v0.4.1(Jun 23, 2021)

Release notes are available here.
Source code(tar.gz)
Source code(zip)
v0.4.0(Apr 16, 2021)

Release notes available here.
Source code(tar.gz)
Source code(zip)
v0.3.14(Feb 8, 2021)
Pin histogrammar version for backwards compatibility

Source code(tar.gz)
Source code(zip)
v0.3.13(Feb 4, 2021)
Spark 3.0 support (histogrammar update) (#87)

Improved documentation

Few minor package improvements

Source code(tar.gz)
Source code(zip)
v0.3.12(Jan 21, 2021)
Add proper check of matrix invertibility of covariance matrix in stats/numpy.py

Add support for the Spark date type

Install Spark on Github Actions to be able to include spark tests in our CI/CD pipeline

Upgrade linting to use pre-commit (including pyupgrade for python3.6 syntax upgrades)

Add documentation on how to run popmon using spark on Google Colab (minimal example from scratch)

Source code(tar.gz)
Source code(zip)
v0.3.11(Dec 7, 2020)
Features:

Traffic light overview (#62)

Documentation:

Downloads badge readme

List talks and articles in readme (#66)

Add image to README.rst (#64)

Other improvements:

Change notebook testing to pytest-notebook (previously these tests were skipped in CI). Add try-except ImportError for pyspark code. (#67)

Fix a few typo's

suppress "matplotlib backend" verbose warning

click on "popmon report" also scrolls to top

Update HTML reports using Github Actions (#63)

Bugfix in hist.py that broke the advanced tutorial.

Notebooks:

Add %%capture to pip install inside of notebooks.

Make package install in notebooks work with paths with spaces.

Pickle doesn't work with tests (not really a popmon-specific feature anyway). Changed the notebook to fix the issue, left the code for reference.

Source code(tar.gz)
Source code(zip)
v0.3.10(Oct 28, 2020)
Traffic light overview

Add image to README

Add building of examples to Github Actions CI

Format notebooks (nbqa)

Remove matplotlib backend warning

Fix navigation in title of report

Source code(tar.gz)
Source code(zip)
v0.3.9(Sep 29, 2020)
Fix: refactorize Bin creation and fix scipy version for pytestDevelop

Fix: dataset links in tutorial

Lint: isort 5, latest black version

Internal: simplification of weighted mean/std computation

Source code(tar.gz)
Source code(zip)
v0.3.8(Jul 9, 2020)
Fixing automated PyPi deployment.

Removing enabling of unnecessary notebook extensions.

Source code(tar.gz)
Source code(zip)
v0.3.7(Jul 8, 2020)
Using ING's matplotlib style for the report plots (orange plots).

Add popmon installation command at the beginning of example notebooks (seamless running).

Source code(tar.gz)
Source code(zip)
v0.3.6(Jul 2, 2020)
Extending make.bat and Makefile to support make install (on all platforms).

Add a snippet on how to use popmon with Spark dataframes to the docs.

Update tutorial badges in the documentation.

Migrate to standard MIT license header.

Source code(tar.gz)
Source code(zip)
v0.3.5(Jun 16, 2020)
Extended the make commands (added make install and make lint check=1 for check only).

Add license headers to source files.

Source code(tar.gz)
Source code(zip)
v0.3.4(Jun 10, 2020)
Several improvements aimed at new users, such as updated landing page, documentation and notebooks

Performance improvement for weighted quantiles

Consistent code formatting using black and isort

Automatic release to PyPi on Github Release

Platform agnostic file handling

More informative exception messages

Source code(tar.gz)
Source code(zip)

Owner

ING Bank

ING Open-source projects

GitHub https://popmon.readthedocs.io/

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

5 Sep 28, 2022

Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 1, 2023

Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

1.1k Dec 28, 2022

Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

14 Jun 22, 2022

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

2 Dec 1, 2021

Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

2 Dec 19, 2021

Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

249 Jan 8, 2023

A data structure that extends pyspark.sql.DataFrame with metadata information.

MetaFrame A data structure that extends pyspark.sql.DataFrame with metadata info

8 Feb 15, 2022

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

2 Dec 4, 2021

Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

1 Nov 22, 2021

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

1 Dec 6, 2021

The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

1 Dec 9, 2021

Pyspark project that able to do joins on the spark data frames.

SPARK JOINS This project is to perform inner, all outer joins and semi joins. create_df.py: load_data.py : helps to put data into Spark data frames. d

1 Dec 14, 2021

This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark application on Spark container

1 Dec 29, 2021

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

1 Jan 6, 2022

Py-price-monitoring - A Python price monitor

A Python price monitor This project was focused on Brazil, so the monitoring is

1 Jan 4, 2022

Senator Trades Monitor

Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito

5 Jun 11, 2022

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

3.3k Jan 4, 2023

NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

3.1k Jan 5, 2023

Monitor the stability of a pandas or spark dataframe ⚙︎

Related tags

Overview

Population Shift Monitoring

Announcements

Spark 3.0

Examples

Documentation

Notebooks

Check it out

Quick run

Pipelines for monitoring dataset shift

Resources

Presentations

Articles

Project contributors

Contact and support

License

Comments

Discussed in https://github.com/ing-bank/popmon/discussions/142

I start with creating a histogrammar object of the original dataframe

later on I receive a new batch of data, which I add to my existing histograms

so far so good, but when I try to repeat these steps with yet another new batch of data, I receive the error

Releases(v1.4.0)

v1.4.0(Oct 19, 2022)

Feature

v1.3.0(Sep 9, 2022)

Feature

Fix

Documentation

v1.2.0(Sep 1, 2022)

Feature

Fix

Documentation

v1.1.0(Aug 19, 2022)

Feature

Documentation

v1.0.0(Jul 8, 2022)

Feature

Fix

Breaking

Documentation

Performance

v0.10.2(Jun 21, 2022)

Fix

v0.10.1(Jun 15, 2022)

Fix

v0.10.0(Jun 14, 2022)

Feature

Fix

Documentation

v0.9.0(May 27, 2022)

Feature

Fix

Documentation

v0.8.0(May 20, 2022)

Feature

Fix

Documentation

Performance

v0.7.0(May 9, 2022)

Feature

Fix

Documentation

v0.6.1(Apr 30, 2022)

Fix

Documentation

v0.6.0(Dec 13, 2021)

Feature

Documentation

v0.5.0(Nov 24, 2021)

Feature

Fix

Documentation

v0.4.4(Oct 22, 2021)

v0.4.3(Oct 4, 2021)

v0.4.2(Aug 26, 2021)

v0.4.1(Jun 23, 2021)

v0.4.0(Apr 16, 2021)

v0.3.14(Feb 8, 2021)