Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

Jeremy Singer-Vine

Last update: Dec 31, 2022

Related tags

Overview

weightedcalcs

weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more.

Features

Plays well with pandas.
Support for weighted means, medians, quantiles, standard deviations, and distributions.
Support for grouped calculations, using DataFrameGroupBy objects.
Raises an error when your data contains null-values.
Full test coverage.

Installation

pip install weightedcalcs

Usage

Getting started

Every weighted calculation in weightedcalcs begins with an instance of the weightedcalcs.Calculator class. Calculator takes one argument: the name of your weighting variable. So if you're analyzing a survey where the weighting variable is called "resp_weight", you'd do this:

import weightedcalcs as wc
calc = wc.Calculator("resp_weight")

Types of calculations

Currently, weightedcalcs.Calculator supports the following calculations:

calc.mean(my_data, value_var): The weighted arithmetic average of value_var.
calc.quantile(my_data, value_var, q): The weighted quantile of value_var, where q is between 0 and 1.
calc.median(my_data, value_var): The weighted median of value_var, equivalent to .quantile(...) where q=0.5.
calc.std(my_data, value_var): The weighted standard deviation of value_var.
calc.distribution(my_data, value_var): The weighted proportions of value_var, interpreting value_var as categories.
calc.count(my_data): The weighted count of all observations, i.e., the total weight.
calc.sum(my_data, value_var): The weighted sum of value_var.

The obj parameter above should one of the following:

A pandas DataFrame object
A pandas DataFrame.groupby object
A plain Python dictionary where the keys are column names and the values are equal-length lists.

Basic example

Below is a basic example of using weightedcalcs to find what percentage of Wyoming residents are married, divorced, et cetera:

import pandas as pd
import weightedcalcs as wc

# Load the 2015 American Community Survey person-level responses for Wyoming
responses = pd.read_csv("examples/data/acs-2015-pums-wy-simple.csv")

# `PWGTP` is the weighting variable used in the ACS's person-level data
calc = wc.Calculator("PWGTP")

# Get the distribution of marriage-status responses
calc.distribution(responses, "marriage_status").round(3).sort_values(ascending=False)

# -- Output --
# marriage_status
# Married                                0.425
# Never married or under 15 years old    0.421
# Divorced                               0.097
# Widowed                                0.046
# Separated                              0.012
# Name: PWGTP, dtype: float64

More examples

See this notebook to see examples of other calculations, including grouped calculations.

Max Ghenis has created a version of the example notebook that can be run directly in your browser, via Google Colab.

Weightedcalcs in the wild

Other Python weighted-calculation libraries

Comments

Create MANIFEST.in

Hey-lo,

I'm building a version of weightedcalcs using conda for conda-forge. When possible, we try to include a link to the license file in the meta.yaml specification for the build; doing so requires the license be indexed in an explicit MANIFEST.in file so that it gets included in the source distribution.

This pull should add a MANIFEST.in that guarantees the license gets included, along with the change log and the Readme.

opened by pmlandwehr 7
FYI: Reproducible example notebook

I wanted to run through your example notebook, so made this version of it which instead uses the 2016 Survey of Consumer Finances. This has the advantage of being hosted directly by the government, so anyone can run the full thing as-is (it downloads the data), e.g. in Google Colab as the above link does.

This version also removes the adult limitation, which isn't relevant at the household level (all SCF household heads are 18+).

Feel free to link it if you'd like.

opened by MaxGhenis 1
Support list of quantiles (e.g. 25th and 75th percentiles)
Just found this package through pandas-dev/pandas#10030, really great stuff. I've put functions for working with weighted data in my microdf package, but looking to remove them in lieu of this and hopefully at some point native pandas support (MaxGhenis/microdf#55).

One feature I've found useful is the ability to pass a list of quantiles to quantile. Currently:

calc.quantile(adults, "income", [0.25, 0.75])

produces:

TypeError: '<' not supported between instances of 'list' and 'int'

This isn't that hard to do outside e.g. with list comprehension so might not be a high priority, but could save some code if it's common:

[calc.quantile(adults, "income", x) for x in [0.25, 0.75]]
opened by MaxGhenis 1
module 'pandas' has no attribute 'indexes'

Hi I run into problems when I follow the example notebook .

The section on Grouped weighted calculations fails

calc.mean(grp_marriage_sex, "income").round().astype(int)

AttributeError Traceback (most recent call last) in () ----> 1 calc.mean(grp_marriage_sex, "income").round().astype(int)

P:\Anaconda3\envs\charite\lib\site-packages\weightedcalcs\core.py in func_wrapper(self, thing, *args, **kwargs) 20 agg = thing.apply(lambda x: func(self, x, *args, **kwargs)) 21 is_series = isinstance(agg, pd.core.series.Series) ---> 22 has_multiindex = isinstance(agg.index, pd.indexes.multi.MultiIndex) 23 if is_series and has_multiindex: 24 return agg.unstack()

AttributeError: module 'pandas' has no attribute 'indexes

I am using pandas '0.20.1' and weightedcalcs '0.1.1'.

opened by eotp 1
Weighted distribution across multiple variables
Could be a useful addition to your library. As an example, I'm interested in getting stats on race and gender in a group over time. Something like:

data_by_year = data.groupby(['year']) race_gender_demographics = calc.distribution(data_by_year, ['race', 'gender']).round(3)
opened by soooh 4
nice package!

we have had this open issue in main pandas for a while: https://github.com/pandas-dev/pandas/issues/10030

here is a prospective API (which is similar to what you did): https://github.com/pandas-dev/pandas/issues/15039

If you'd have a look would be great. I think what you did here would be a nice contribution if you are interested.

opened by jreback 1

Owner

Jeremy Singer-Vine

Human @ Internet • Data Editor @ BuzzFeed News • Newsletter-er @ data-is-plural.com

GitHub

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

2 Dec 1, 2021

Python utility to extract differences between two pandas dataframes.

8 Jan 7, 2023

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

4 Oct 17, 2022

Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

Elicited Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations. Credit to Brett Hoove

3 Nov 4, 2022

statDistros is a Python library for dealing with various statistical distributions

StatisticalDistributions statDistros statDistros is a Python library for dealing with various statistical distributions. Now it provides various stati

1 Oct 3, 2021

Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

14 Aug 19, 2022

Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

1.1k Dec 28, 2022

Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

1.2k Dec 31, 2022

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

3.3k Jan 4, 2023

NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

3.1k Jan 5, 2023

Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

18 Nov 28, 2022

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 6, 2021

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

Related tags

Overview

weightedcalcs

Features

Installation

Usage

Getting started

Types of calculations

Basic example

More examples

Weightedcalcs in the wild

Other Python weighted-calculation libraries

Comments

Create MANIFEST.in

FYI: Reproducible example notebook

Support list of quantiles (e.g. 25th and 75th percentiles)

module 'pandas' has no attribute 'indexes'

Weighted distribution across multiple variables

nice package!

Owner

Jeremy Singer-Vine

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Python utility to extract differences between two pandas dataframes.

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

statDistros is a Python library for dealing with various statistical distributions

Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Statistical package in Python based on Pandas

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

NumPy and Pandas interface to Big Data

Pandas and Dask test helper methods with beautiful error messages.

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

A data analysis using python and pandas to showcase trends in school performance.

Pandas and Spark DataFrame comparison for humans

An extension to pandas dataframes describe function.

Create HTML profiling reports from pandas DataFrame objects

Bearsql allows you to query pandas dataframe with sql syntax.