An extension to pandas dataframes describe function.

Overview

pandas_summary

An extension to pandas dataframes describe function.

The module contains DataFrameSummary object that extend describe() with:

  • properties
    • dfs.columns_stats: counts, uniques, missing, missing_perc, and type per column
    • dsf.columns_types: a count of the types of columns
    • dfs[column]: more in depth summary of the column
  • function
    • summary(): extends the describe() function with the values with columns_stats

Installation

The module can be easily installed with pip:

> pip install pandas-summary

This module depends on numpy and pandas. Optionally you can get also some nice visualisations if you have matplotlib installed.

Tests

To run the tests, execute the command python setup.py test

Usage

The module contains one class:

DataFrameSummary

The DataFrameSummary expect a pandas DataFrame to summarise.

from pandas_summary import DataFrameSummary

dfs = DataFrameSummary(df)

getting the columns types

dfs.columns_types


numeric     9
bool        3
categorical 2
unique      1
date        1
constant    1
dtype: int64

getting the columns stats

dfs.columns_stats


                      A            B        C              D              E 
counts             5802         5794     5781           5781           4617   
uniques            5802            3     5771            128            121   
missing               0            8       21             21           1185   
missing_perc         0%        0.14%    0.36%          0.36%         20.42%   
types            unique  categorical  numeric        numeric        numeric 

getting a single column summary, e.g. numerical column

# we can also access the column using numbers A[1]
dfs['A']

std                                                                 0.2827146
max                                                                  1.072792
min                                                                         0
variance                                                           0.07992753
mean                                                                0.5548516
5%                                                                  0.1603367
25%                                                                 0.3199776
50%                                                                 0.4968588
75%                                                                 0.8274732
95%                                                                  1.011255
iqr                                                                 0.5074956
kurtosis                                                            -1.208469
skewness                                                            0.2679559
sum                                                                  3207.597
mad                                                                 0.2459508
cv                                                                  0.5095319
zeros_num                                                                  11
zeros_perc                                                               0,1%
deviating_of_mean                                                          21
deviating_of_mean_perc                                                  0.36%
deviating_of_median                                                        21
deviating_of_median_perc                                                0.36%
top_correlations                         {u'D': 0.702240243124, u'E': -0.663}
counts                                                                   5781
uniques                                                                  5771
missing                                                                    21
missing_perc                                                            0.36%
types                                                                 numeric
Name: A, dtype: object

Future development

Summary analysis between columns, i.e. dfs[[1, 2]]

Comments
  • Python 3 support

    Python 3 support

    Right now it works only on Python 2 (at least due to print statements, I didn't check if there are any bigger differences). Do you plan to make it Python 3 compatible?

    opened by stared 4
  • API for get_type functions

    API for get_type functions

    introduce @property methods for

    • get_constants
    • get_categoricals
    • get_numerics
    • get_uniques
    • get_bools
    df = pd.DataFrame({
        'A': [1,2,3,4,5],
        'B': [0,1,1,1,0],
        'C': ['A', 'B', 'A', 'B', 'C'],
        'D': ['A', 'A', 'A', 'A', 'A'],
        'E': ['A', 'B', 'C', 'D', 'E']
    })
    
    dfs = DataFrameSummary(df)
    
    print(dfs.get_numerics)
    # Index(['A'], dtype='object')
    
    print(dfs.get_bools)
    # Index(['B'], dtype='object')
    
    print(dfs.get_categoricals)
    # Index(['C'], dtype='object')
    
    print(dfs.get_constants)
    # Index(['D'], dtype='object')
    
    print(dfs.get_uniques)
    # Index(['E'], dtype='object')
    
    opened by sizhky 3
  • pandas 0.23.0 change to types

    pandas 0.23.0 change to types

    To address #10

    Added from pandas.api import types.

    Changed from common.is_numeric_dtype and common.is_datetime64_dtype in lines 113 and 116 to types.--- to make things work with pandas 0.23.0.

    Not sure if you use from pandas.core import common anywhere else in your code but quick test with it removed doesn't seem to break anything.

    Didn't do extensive testing but does seem to get things working again.

    opened by notauni 3
  • Merge this in pandas-profiling?

    Merge this in pandas-profiling?

    Should we think about merging this with pandas-profiling? Effort would be fairly low, and I like the idea of creating a DataFrameSummary object that you can query and which returns information.

    opened by JosPolfliet 3
  • PyPI is considering 0.0.41 as latest version

    PyPI is considering 0.0.41 as latest version

    Hello,

    Thank you for all the work creating this project. It appears PyPI is considering 0.0.41 as the latest/default version instead of 0.0.5, therefore, 'pip install pandas-summary' is installing the old version.

    Ryan

    opened by rlshuhart 1
  • Add LICENSE to MANIFEST.in

    Add LICENSE to MANIFEST.in

    Could you please add the license to MANIFEST.in so that it will be included in sdists and other packages? This came up during packaging of pandas-summary in conda-forge.

    opened by proinsias 1
  • Add Mode to Numeric Statistic

    Add Mode to Numeric Statistic

    This pull request is to add mode to the numeric statistic summary for columns as it is useful to know the most frequent value for some numeric columns, such as Age, Price, etc.

    The Series implementation of mode() returns a series, hence the index reference - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mode.html#pandas.Series.mode

    Added mode to the testcase as well.

    Also updated a testcase where the summary was getting re-sorted when it was already sorted. All test cases now pass.

    opened by Ashton-Sidhu 0
  • Patch level went down?

    Patch level went down?

    Hi, I am trying to get pandas-summary working, and had version 0.0.41 installed. I ran into the issue fixed by #11/#12, so I tried to upgrade. It seems like the version was changed to 0.0.5 with https://github.com/mouradmourafiq/pandas-summary/commit/42227d51d8d458ebce7090db971259565fb6ccdf

    When I try to upgrade to 0.0.5, pip picks up the 0.0.41 version since its patch level is considered greater than 0.0.5. I am somewhat new to the Python world, so I could be missing an easy way to do this, but wondering if the new version should be updated to 0.0.42 or 0.1.0 to better comply with semver.

    pip install pandas-summary==
    Collecting pandas-summary==
      Could not find a version that satisfies the requirement pandas-summary== (from versions: 0.0.3, 0.0.4, 0.0.5, 0.0.41)
    No matching distribution found for pandas-summary==
    

    For now my workaround is to use pip install pandas-summary==0.0.5 to install the exact version that I want to use.

    opened by panozzaj 0
  • Patch level went d?

    Patch level went d?

    Hi, I am trying to get pandas-summary working, and had version 0.0.41 installed. I ran into the issue fixed by #11/#12, so I tried to upgrade. It seems like the version was changed to 0.0.5 with https://github.com/mouradmourafiq/pandas-summary/commit/42227d51d8d458ebce7090db971259565fb6ccdf

    When I try to upgrade to 0.0.5, pip picks up the 0.0.41 version since its patch level is considered greater than 0.0.5. I am somewhat new to the Python world, so I could be missing an easy way to do this, but wondering if the new version should be updated to 0.0.42 or 0.1.0 to better comply with semver.

    pip install pandas-summary==
    Collecting pandas-summary==
      Could not find a version that satisfies the requirement pandas-summary== (from versions: 0.0.3, 0.0.4, 0.0.5, 0.0.41)
    No matching distribution found for pandas-summary==
    

    For now my workaround is to use pip install pandas-summary==0.0.5 to install the exact version that I want to use.

    opened by panozzaj 0
  • ValueError: supplied range of [0.0, inf] is not finite

    ValueError: supplied range of [0.0, inf] is not finite

    I get the error

    ValueError: supplied range of [0.0, inf] is not finite
    

    when doing

    DataFrameSummary(df)["OR"]
    

    on a column with inf values.

    Thanks for the library btw :)

    bug 
    opened by endrebak 0
  • Use more than one thread

    Use more than one thread

    I'd love to see these calculations split across multiple threads to reduce the clock computation time, since I often deal with really large data sets.

    opened by proinsias 0
  • Show missing columns only

    Show missing columns only

    Thanks for the awesome plugin !

    1. possible to add colors to point out missing values. Light shade of red if missing is > 0.
    2. possible to display only missing columns? Sometimes a dataframe has a lot of columns and user is mostly interested in missing information.

    I am new to python, if you guide me where to look, I can create a pull request. Thank you.

    opened by upkarlidder 0
  • support for non-hashable objects

    support for non-hashable objects

    If a column has non-hashable objects, creation of DataFrameSummary gives an error:

    TypeError: unhashable type: 'list'
    

    It would be great if work also for these entries (e.g. looking at their length); or at very least - discard these variables.

    opened by stared 2
Owner
Mourad
engineer, startup enthusiast, philosophy and music lover, coffeeholic... and more
Mourad
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

Lawrence Livermore National Laboratory 14 Aug 19, 2022
Useful tool for inserting DataFrames into the Excel sheet.

PyCellFrame Insert Pandas DataFrames into the Excel sheet with a bunch of conditions Install pip install pycellframe Usage Examples Let's suppose that

Luka Sosiashvili 1 Feb 16, 2022
A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

Andre Pretorius 9 Sep 15, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 4, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 5, 2023
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

null 10k Jan 1, 2023
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Cedric Zhuang 1.1k Dec 28, 2022
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Jeremy Singer-Vine 98 Dec 31, 2022
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

null 14 Jun 22, 2022
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
Conduits - A Declarative Pipelining Tool For Pandas

Conduits - A Declarative Pipelining Tool For Pandas Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can some

Kale Miller 7 Nov 21, 2021
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

null 5 Sep 6, 2021
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 7, 2021
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

null 48 Dec 21, 2022
A crude Hy handle on Pandas library

Quickstart Hyenas is a curde Hy handle written on top of Pandas API to allow for more elegant access to data-scientist's powerhouse that is Pandas. In

Peter Výboch 4 Sep 5, 2022
Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

RPA Challenge in Python Projeto para realizar o RPA Challenge (www.rpachallenge.com), utilizando Python. O objetivo deste desafio é criar um fluxo de

Henrique A. Lourenço 1 Apr 12, 2022
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022