Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

Overview

Version Build status Code coverage Support Python versions

weightedcalcs

weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more.

Features

  • Plays well with pandas.
  • Support for weighted means, medians, quantiles, standard deviations, and distributions.
  • Support for grouped calculations, using DataFrameGroupBy objects.
  • Raises an error when your data contains null-values.
  • Full test coverage.

Installation

pip install weightedcalcs

Usage

Getting started

Every weighted calculation in weightedcalcs begins with an instance of the weightedcalcs.Calculator class. Calculator takes one argument: the name of your weighting variable. So if you're analyzing a survey where the weighting variable is called "resp_weight", you'd do this:

import weightedcalcs as wc
calc = wc.Calculator("resp_weight")

Types of calculations

Currently, weightedcalcs.Calculator supports the following calculations:

  • calc.mean(my_data, value_var): The weighted arithmetic average of value_var.
  • calc.quantile(my_data, value_var, q): The weighted quantile of value_var, where q is between 0 and 1.
  • calc.median(my_data, value_var): The weighted median of value_var, equivalent to .quantile(...) where q=0.5.
  • calc.std(my_data, value_var): The weighted standard deviation of value_var.
  • calc.distribution(my_data, value_var): The weighted proportions of value_var, interpreting value_var as categories.
  • calc.count(my_data): The weighted count of all observations, i.e., the total weight.
  • calc.sum(my_data, value_var): The weighted sum of value_var.

The obj parameter above should one of the following:

  • A pandas DataFrame object
  • A pandas DataFrame.groupby object
  • A plain Python dictionary where the keys are column names and the values are equal-length lists.

Basic example

Below is a basic example of using weightedcalcs to find what percentage of Wyoming residents are married, divorced, et cetera:

import pandas as pd
import weightedcalcs as wc

# Load the 2015 American Community Survey person-level responses for Wyoming
responses = pd.read_csv("examples/data/acs-2015-pums-wy-simple.csv")

# `PWGTP` is the weighting variable used in the ACS's person-level data
calc = wc.Calculator("PWGTP")

# Get the distribution of marriage-status responses
calc.distribution(responses, "marriage_status").round(3).sort_values(ascending=False)

# -- Output --
# marriage_status
# Married                                0.425
# Never married or under 15 years old    0.421
# Divorced                               0.097
# Widowed                                0.046
# Separated                              0.012
# Name: PWGTP, dtype: float64

More examples

See this notebook to see examples of other calculations, including grouped calculations.

Max Ghenis has created a version of the example notebook that can be run directly in your browser, via Google Colab.

Weightedcalcs in the wild

Other Python weighted-calculation libraries

Comments
  • Create MANIFEST.in

    Create MANIFEST.in

    Hey-lo,

    I'm building a version of weightedcalcs using conda for conda-forge. When possible, we try to include a link to the license file in the meta.yaml specification for the build; doing so requires the license be indexed in an explicit MANIFEST.in file so that it gets included in the source distribution.

    This pull should add a MANIFEST.in that guarantees the license gets included, along with the change log and the Readme.

    opened by pmlandwehr 7
  • FYI: Reproducible example notebook

    FYI: Reproducible example notebook

    I wanted to run through your example notebook, so made this version of it which instead uses the 2016 Survey of Consumer Finances. This has the advantage of being hosted directly by the government, so anyone can run the full thing as-is (it downloads the data), e.g. in Google Colab as the above link does.

    This version also removes the adult limitation, which isn't relevant at the household level (all SCF household heads are 18+).

    Feel free to link it if you'd like.

    opened by MaxGhenis 1
  • Support list of quantiles (e.g. 25th and 75th percentiles)

    Support list of quantiles (e.g. 25th and 75th percentiles)

    Just found this package through pandas-dev/pandas#10030, really great stuff. I've put functions for working with weighted data in my microdf package, but looking to remove them in lieu of this and hopefully at some point native pandas support (MaxGhenis/microdf#55).

    One feature I've found useful is the ability to pass a list of quantiles to quantile. Currently:

    calc.quantile(adults, "income", [0.25, 0.75])
    

    produces:

    TypeError: '<' not supported between instances of 'list' and 'int'

    This isn't that hard to do outside e.g. with list comprehension so might not be a high priority, but could save some code if it's common:

    [calc.quantile(adults, "income", x) for x in [0.25, 0.75]]
    
    opened by MaxGhenis 1
  • module 'pandas' has no attribute 'indexes'

    module 'pandas' has no attribute 'indexes'

    Hi I run into problems when I follow the example notebook .

    The section on Grouped weighted calculations fails

    calc.mean(grp_marriage_sex, "income").round().astype(int)


    AttributeError Traceback (most recent call last) in () ----> 1 calc.mean(grp_marriage_sex, "income").round().astype(int)

    P:\Anaconda3\envs\charite\lib\site-packages\weightedcalcs\core.py in func_wrapper(self, thing, *args, **kwargs) 20 agg = thing.apply(lambda x: func(self, x, *args, **kwargs)) 21 is_series = isinstance(agg, pd.core.series.Series) ---> 22 has_multiindex = isinstance(agg.index, pd.indexes.multi.MultiIndex) 23 if is_series and has_multiindex: 24 return agg.unstack()

    AttributeError: module 'pandas' has no attribute 'indexes

    I am using pandas '0.20.1' and weightedcalcs '0.1.1'.

    opened by eotp 1
  • Weighted distribution across multiple variables

    Weighted distribution across multiple variables

    Could be a useful addition to your library. As an example, I'm interested in getting stats on race and gender in a group over time. Something like:

    data_by_year = data.groupby(['year'])
    race_gender_demographics = calc.distribution(data_by_year, ['race', 'gender']).round(3)
    
    opened by soooh 4
  • nice package!

    nice package!

    we have had this open issue in main pandas for a while: https://github.com/pandas-dev/pandas/issues/10030

    here is a prospective API (which is similar to what you did): https://github.com/pandas-dev/pandas/issues/15039

    If you'd have a look would be great. I think what you did here would be a nice contribution if you are interested.

    opened by jreback 1
Owner
Jeremy Singer-Vine
Human @ Internet • Data Editor @ BuzzFeed News • Newsletter-er @ data-is-plural.com
Jeremy Singer-Vine
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

Brady Law 2 Dec 1, 2021
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 7, 2023
A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

Rishikesh S 4 Oct 17, 2022
Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

Elicited Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations. Credit to Brett Hoove

Ryan McGeehan 3 Nov 4, 2022
statDistros is a Python library for dealing with various statistical distributions

StatisticalDistributions statDistros statDistros is a Python library for dealing with various statistical distributions. Now it provides various stati

null 1 Oct 3, 2021
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

Lawrence Livermore National Laboratory 14 Aug 19, 2022
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Cedric Zhuang 1.1k Dec 28, 2022
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

null 48 Dec 21, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 4, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 5, 2023
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

null 5 Sep 6, 2021
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 7, 2021
Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

Capital One 259 Dec 24, 2022
An extension to pandas dataframes describe function.

pandas_summary An extension to pandas dataframes describe function. The module contains DataFrameSummary object that extend describe() with: propertie

Mourad 450 Dec 30, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

null 10k Jan 1, 2023
Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

null 14 Jun 22, 2022