Statistical package in Python based on Pandas

Raphael Vallat

Last update: Dec 31, 2022

Related tags

Data Analysis statistics statistical-methods pandas statistical-tests circular-statistics anova bayesian-statistics multiple-comparisons effect-size cohens-d correlations ttest

Overview

https://travis-ci.org/raphaelvallat/pingouin.svg?branch=master

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. For a full list of available functions, please refer to the API documentation.

ANOVAs: N-ways, repeated measures, mixed, ancova
Pairwise post-hocs tests (parametric and non-parametric) and pairwise correlations
Robust, partial, distance and repeated measures correlations
Linear/logistic regression and mediation analysis
Bayes Factors
Multivariate tests
Reliability and consistency
Effect sizes and power analysis
Parametric/bootstrapped confidence intervals around an effect size or a correlation coefficient
Circular statistics
Chi-squared tests
Plotting: Bland-Altman plot, Q-Q plot, paired plot, robust correlation...

Pingouin is designed for users who want simple yet exhaustive statistical functions.

For example, the ttest_ind function of SciPy returns only the T-value and the p-value. By contrast, the ttest function of Pingouin returns the T-value, the p-value, the degrees of freedom, the effect size (Cohen's d), the 95% confidence intervals of the difference in means, the statistical power and the Bayes Factor (BF10) of the test.

Documentation

Link to documentation

Chat

If you have questions, please ask them in the public Gitter chat

Installation

Dependencies

The main dependencies of Pingouin are :

NumPy (>= 1.16.5)
SciPy (>= 1.3.0)
Pandas (>= 0.24)
Pandas-flavor (>= 0.1.2)
Matplotlib (>= 3.0.2)
Seaborn (>= 0.9.0)
Outdated

In addition, some functions require :

Pingouin is a Python 3 package and is currently tested for Python 3.6 and 3.7. Pingouin does not work with Python 2.7.

User installation

Pingouin can be easily installed using pip

pip install pingouin

or conda

conda install -c conda-forge pingouin

New releases are frequent so always make sure that you have the latest version:

pip install --upgrade pingouin

Quick start

Click on the link below and navigate to the notebooks/ folder to run a collection of interactive Jupyter notebooks showing the main functionalities of Pingouin. No need to install Pingouin beforehand, the notebooks run in a Binder environment.

10 minutes to Pingouin

1. T-test

import numpy as np
import pingouin as pg

np.random.seed(123)
mean, cov, n = [4, 5], [(1, .6), (.6, 1)], 30
x, y = np.random.multivariate_normal(mean, cov, n).T

# T-test
pg.ttest(x, y)

Output

T	dof	tail	p-val	CI95%	cohen-d	BF10	power
-3.401	58	two-sided	0.001	[-1.68 -0.43]	0.878	26.155	0.917

2. Pearson's correlation

pg.corr(x, y)

Output

n	r	CI95%	r2	adj_r2	p-val	BF10	power
30	0.595	[0.3 0.79]	0.354	0.306	0.001	69.723	0.95

3. Robust correlation

# Introduce an outlier
x[5] = 18
# Use the robust Shepherd's pi correlation
pg.corr(x, y, method="shepherd")

Output

n	outliers	r	CI95%	r2	adj_r2	p-val	power
30	1	0.561	[0.25 0.77]	0.315	0.264	0.002	0.917

4. Test the normality of the data

The pingouin.normality function works with lists, arrays, or pandas DataFrame in wide or long-format.

print(pg.normality(x))                                    # Univariate normality
print(pg.multivariate_normality(np.column_stack((x, y)))) # Multivariate normality

Output

W	pval	normal
0.615	0.000	False

(False, 0.00018)

5. One-way ANOVA using a pandas DataFrame

# Read an example dataset
df = pg.read_dataset('mixed_anova')

# Run the ANOVA
aov = pg.anova(data=df, dv='Scores', between='Group', detailed=True)
print(aov)

Output

Source	SS	DF	MS	F	p-unc	np2
Group	5.460	1	5.460	5.244	0.023	0.029
Within	185.343	178	1.041	nan	nan	nan

6. Repeated measures ANOVA

pg.rm_anova(data=df, dv='Scores', within='Time', subject='Subject', detailed=True)

Output

Source	SS	DF	MS	F	p-unc	np2	eps
Time	7.628	2	3.814	3.913	0.023	0.062	0.999
Error	115.027	118	0.975	nan	nan	nan	nan

7. Post-hoc tests corrected for multiple-comparisons

# FDR-corrected post hocs with Hedges'g effect size
posthoc = pg.pairwise_ttests(data=df, dv='Scores', within='Time', subject='Subject',
                             parametric=True, padjust='fdr_bh', effsize='hedges')

# Pretty printing of table
pg.print_table(posthoc, floatfmt='.3f')

Output

Contrast	A	B	Paired	Parametric	T	dof	Tail	p-unc	p-corr	p-adjust	BF10	hedges
Time	August	January	True	True	-1.740	59.000	two-sided	0.087	0.131	fdr_bh	0.582	-0.328
Time	August	June	True	True	-2.743	59.000	two-sided	0.008	0.024	fdr_bh	4.232	-0.485
Time	January	June	True	True	-1.024	59.000	two-sided	0.310	0.310	fdr_bh	0.232	-0.170

8. Two-way mixed ANOVA

# Compute the two-way mixed ANOVA
aov = pg.mixed_anova(data=df, dv='Scores', between='Group', within='Time',
                     subject='Subject', correction=False, effsize="np2")
pg.print_table(aov)

Output

Source	SS	DF1	DF2	MS	F	p-unc	np2	eps
Group	5.460	1	58	5.460	5.052	0.028	0.080	nan
Time	7.628	2	116	3.814	4.027	0.020	0.065	0.999
Interaction	5.167	2	116	2.584	2.728	0.070	0.045	nan

9. Pairwise correlations between columns of a dataframe

import pandas as pd
np.random.seed(123)
z = np.random.normal(5, 1, 30)
data = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
pg.pairwise_corr(data, columns=['X', 'Y', 'Z'], method='pearson')

Output

X	Y	method	tail	n	r	CI95%	r2	adj_r2	z	p-unc	BF10	power
X	Y	pearson	two-sided	30	0.366	[0.01 0.64]	0.134	0.070	0.384	0.047	1.500	0.525
X	Z	pearson	two-sided	30	0.251	[-0.12 0.56]	0.063	-0.006	0.257	0.181	0.534	0.272
Y	Z	pearson	two-sided	30	0.020	[-0.34 0.38]	0.000	-0.074	0.020	0.916	0.228	0.051

10. Convert between effect sizes

# Convert from Cohen's d to Hedges' g
pg.convert_effsize(0.4, 'cohen', 'hedges', nx=10, ny=12)

0.384

11. Multiple linear regression

pg.linear_regression(data[['X', 'Z']], data['Y'])

Linear regression summary

names	coef	se	T	pval	r2	adj_r2	CI[2.5%]	CI[97.5%]
Intercept	4.650	0.841	5.530	0.000	0.139	0.076	2.925	6.376
X	0.143	0.068	2.089	0.046	0.139	0.076	0.003	0.283
Z	-0.069	0.167	-0.416	0.681	0.139	0.076	-0.412	0.273

12. Mediation analysis

pg.mediation_analysis(data=data, x='X', m='Z', y='Y', seed=42, n_boot=1000)

Mediation summary

path	coef	se	pval	CI[2.5%]	CI[97.5%]	sig
Z ~ X	0.103	0.075	0.181	-0.051	0.256	No
Y ~ Z	0.018	0.171	0.916	-0.332	0.369	No
Total	0.136	0.065	0.047	0.002	0.269	Yes
Direct	0.143	0.068	0.046	0.003	0.283	Yes
Indirect	-0.007	0.025	0.898	-0.069	0.029	No

13. Contingency analysis

data = pg.read_dataset('chi2_independence')
expected, observed, stats = pg.chi2_independence(data, x='sex', y='target')
stats

Chi-squared tests summary

test	lambda	chi2	dof	cramer	power
pearson	1.000	22.717	1.000	0.274	0.997
cressie-read	0.667	22.931	1.000	0.275	0.998
log-likelihood	0.000	23.557	1.000	0.279	0.998
freeman-tukey	-0.500	24.220	1.000	0.283	0.998
mod-log-likelihood	-1.000	25.071	1.000	0.288	0.999
neyman	-2.000	27.458	1.000	0.301	0.999

Integration with Pandas

Several functions of Pingouin can be used directly as pandas DataFrame methods. Try for yourself with the code below:

import pingouin as pg

# Example 1 | ANOVA
df = pg.read_dataset('mixed_anova')
df.anova(dv='Scores', between='Group', detailed=True)

# Example 2 | Pairwise correlations
data = pg.read_dataset('mediation')
data.pairwise_corr(columns=['X', 'M', 'Y'], covar=['Mbin'])

# Example 3 | Partial correlation matrix
data.pcorr()

The functions that are currently supported as pandas method are:

Development

Pingouin was created and is maintained by Raphael Vallat, mostly during his spare time. Contributions are more than welcome so feel free to contact me, open an issue or submit a pull request!

To see the code or report a bug, please visit the GitHub repository.

Note that this program is provided with NO WARRANTY OF ANY KIND. If you can, always double check the results with another statistical software.

Contributors

How to cite Pingouin?

If you want to cite Pingouin, please use the publication in JOSS:

Vallat, R. (2018). Pingouin: statistics in Python. Journal of Open Source Software, 3(31), 1026, https://doi.org/10.21105/joss.01026

Acknowledgement

Several functions of Pingouin were inspired from R or Matlab toolboxes, including:

Comments

ValueError with large dataset

I have a large dataset of 500k rows and 74 columns. Whenever I try to use a pairwise partial correlation I get the following error:

----> 1 data2.pairwise_corr(covar = 'SEX')

~/.local/lib/python3.7/site-packages/pandas_flavor/register.py in __call__(self, *args, **kwargs)
     27             @wraps(method)
     28             def __call__(self, *args, **kwargs):
---> 29                 return method(self._obj, *args, **kwargs)
     30 
     31         register_dataframe_accessor(method.__name__)(AccessorMethod)

~/.local/lib/python3.7/site-packages/pingouin/pairwise.py in pairwise_corr(data, columns, covar, tail, method, padjust, nan_policy)
   1229         else:
   1230             cor_st = partial_corr(data=data, x=col1, y=col2, covar=covar,
-> 1231                                   tail=tail, method=method)
   1232         cor_st_keys = cor_st.columns.tolist()
   1233 

~/.local/lib/python3.7/site-packages/pingouin/correlation.py in partial_corr(data, x, y, covar, x_covar, y_covar, tail, method, **kwargs)
    783         # PARTIAL CORRELATION
    784         cvar = np.atleast_2d(C[covar].to_numpy())
--> 785         beta_x = np.linalg.lstsq(cvar, C[x].to_numpy(), rcond=None)[0]
    786         beta_y = np.linalg.lstsq(cvar, C[y].to_numpy(), rcond=None)[0]
    787         res_x = C[x].to_numpy() - cvar @ beta_x

<__array_function__ internals> in lstsq(*args, **kwargs)

/curc/sw/anaconda3/2019.07/envs/jupyterlab2/lib/python3.7/site-packages/numpy/linalg/linalg.py in lstsq(a, b, rcond)
   2257         # lapack can't handle n_rhs = 0 - so allocate the array one larger in that axis
   2258         b = zeros(b.shape[:-2] + (m, n_rhs + 1), dtype=b.dtype)
-> 2259     x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj)
   2260     if m == 0:
   2261         x[...] = 0

ValueError: On entry to DLASCL parameter number 4 had an illegal value

This is even with 12 cores running, is there any way to resolve this or is the pairwise correlation with a covariate just not compatible with a large dataset? pcorr() and pairwise_corr() work as methods on the dataframe.

feature request :construction: invalid :triangular_flag_on_post: question :raising_hand:

opened by jackransomlovell 23

Unexpected results in semi-partial correlation with Spearman/Kendall
As a test, I generated correlated variables x and y. I then set x_test to x and ran a semi-partial correlation of xtest against y, controlling for the effect x on y. As one would expect, the pearson semi-partial correlation of this is zero. However, when applying Spearman and Kendall, the result is non-zero.

The Spearman and Kendall result was 0.14 and 0.11 respectively. Although it is low using the following dummy dataset, the data I am working with produces a very high semi-partial correlation around 0.8 which does not make sense.

df=pd.DataFrame({"y":[1,2,3,4,5,6,7,8,9,10],"x":[1,3,5,8,7,10,13,14,18,21]}) print(df.corr()) df['x_test']=df['x'] partial_corr(data=df,x='x_test',y='y',y_covar=['x'], method='spearman')['r'].values Out: y x y 1.000000 0.985318 x 0.985318 1.000000 array([0.13939394])
invalid :triangular_flag_on_post:
opened by nrcjea001 20
Shapiro-Wilk test should have option to return W statistic

Currently pingouin.normality() silently discards the W statistic returned by scipy.stats.shapiro(). I would love to have an option to retrieve these stats, as I need to present them together with the p values when drafting a scientific publication. I wonder if the function could not be adjusted to return a DataFrame with numerous information like most of the other pingouin functions?
feature request :construction:

opened by hoechenberger 16
Add axis title to plot_shift

The entire library is absolutely amazing; thank you for developing it. I came up with a minor improvement for the "Shift plot". No matter what data array we provide, this graphic shows X and Y. There may be disagreement about its use in academic papers. I believed that by defining X and Y, this conflict would be addressed. I created an edit for this where the user can configure the X and Y variables as well as the plot title. The labels being on the Y axis creates problems with the size of the figure, thus I located them as a legend in the plot.

I have ideas for other modules and would like to add them when I have the time. My initial contribution would be adding the odds ratio and CI-range values to the chi-sqr module.

Here my test code; plotting_test.zip

opened by turkalpmd 15

multiple regression: catch rank deficient design matrices and warn

closes #130

For now I opted to only return the coefficients in case a rank deficient design matrix is detected.

The basic error of pingouin is its inability to handle the residues return parameter of lstsq whenever the design matrix is rank deficient. In that case, residues is an empty array array([], dtype=float64) instead of a float.

One option apart from warning and simply returning the coefficients only could be to "revive" the currently commented line from line 413(old) / 422(new): # ss_res = (resid ** 2).sum()

Something like:

    # FIT (WEIGHTED) LEAST SQUARES REGRESSION USING SCIPY.LINALG.LSTST
    coef, ss_res, rank, _ = lstsq(Xw, yw)
    if coef_only:
        return coef
    calc_ss_res = False
    if rank < Xw.shape[1]:
        warnings.warn('Design matrix supplied with `X` parameter is rank '
                      f'deficient (rank {rank} with {Xw.shape[1]} columns. '
                      'That means that one or more of the columns in `X` '
                      'are a linear combination of one of more of the '
                      'other columns.'')
        calc_ss_res = True

    # Degrees of freedom
    df_model = rank - constant
    df_resid = n - p
    # Calculate predicted values and (weighted) residuals
    pred = Xw @ coef
    resid = yw - pred
    if calc_ss_res:
        ss_res = (resid ** 2).sum()

I played around with that, and it seems to be producing the same results as statsmodels. Try for yourself if you'd like:

import numpy as np 
import pingouin 
import statsmodels.api as sm 

# make up some random data and a design matrix for multiple linear regression 
n = 100  
y = np.random.randn(n) 
X = np.vstack([np.random.permutation([0, 1, 0, 0, 0]) for i in range(n)]) 

# try with pingouin (use patch from code example above)
results_pingouin = pingouin.linear_regression(X, y, add_intercept=True) 
print(results_pingouin)

# try with statsmodels
X_with_intercept = sm.add_constant(X) 
model = sm.OLS(endog=y, exog=X_with_intercept) 
results_sm = model.fit() 
print(results_sm.summary())

# compare residuals
np.allclose(results_pingouin.residuals_, results_sm.resid)  # True
np.allclose(results_pingouin.coef.to_numpy(), results_sm.params)  # True

Let me know what you think :-)

feature request :construction:

opened by sappelhoff 14

pairwise_corr doc updated and extended functionality to automatically control for all remaining covariates

I updated doc of pairwise_corr to reflect more clearly that a partial corr is only calculated, if covariates are specified.

Furthermore I implemented that combinations of columns, which are also present in covars, will not be dropped. Instead the x and y columns will be removed from covar only for this specific combination. This allows for automatically controlling for all other columns in the dataframe while calculating the pairwise partial correlation for all combinations.

If this means that there are not covars for a combination, an empty list will be returned for stats['covar'] and via partial_corr line 744 a standard non-partial corr will be returned. Is this clear enough for the user or should "empty" covar rows be dropped instead?

Closes #123

Some basic examples of the funcitonality (tests are included in the respective test file)

import numpy as np
import pandas as pd

import pingouin as pg

df = pd.DataFrame(
    data=np.random.rand(200, 4), columns=['a', 'b', 'c', 'd'])
df['e'] = df['a']**2
df['f'] = df['a'] * df['b']
df['g'] = df['c'] + np.random.randint(0, 5, 200)
df['h'] = np.sqrt(df['d'] + 5) - df['b']**3

# Get pairwise corr for all available covars per combination, also try nan policy
pwc = pg.pairwise_corr(data=df, covar=df.columns)
pwc_lwnan = pg.pairwise_corr(data=df, covar=df.columns, nan_policy='listwise')
# assert that both are equal
assert (
    pwc.drop(columns=['CI95%']) == pwc_lwnan.drop(columns=['CI95%'])
).all().all()

# Now the same with a MultiIndex:
mi = pd.MultiIndex.from_product((
    ('a', 'b', 'c', 'd'), ('u', 'v'),
))
df_mi = df.copy()
df_mi.columns = mi

pwc_mi = pg.pairwise_corr(data=df_mi, covar=df_mi.columns)
# Assert that the results are equal to non-multiindex tests
assert (
    pwc_mi.drop(
        columns=['X', 'Y', 'covar', 'CI95%']) == pwc.drop(
            columns=['X', 'Y', 'covar', 'CI95%'])
).all().all()
# and compare to nan policy listwise
pwc_mi_lwnan = pg.pairwise_corr(
    data=df_mi, covar=df_mi.columns, nan_policy='listwise'
)
assert (
    pwc_mi.drop(
        columns=['X', 'Y', 'covar', 'CI95%']) == pwc.drop(
            columns=['X', 'Y', 'covar', 'CI95%'])
).all().all()

And to show what I mean with "empty covars" (taken from test_pairwise.py):

from pingouin import pairwise_corr, read_dataset

data = read_dataset('pairwise_corr').iloc[:, 1:]
n = data.shape[0]
data['Age'] = np.random.randint(18, 65, n)
data['IQ'] = np.random.normal(105, 1, n)
data['One'] = 1
data['Gender'] = np.repeat(['M', 'F'], int(n / 2))

# see row 4 for empty covar
pwc_empty = pairwise_corr(data, columns='Neuroticism', covar='Age')
print(pwc_empty.covar)
# or fully empty covar
pwc_fully_empty = pairwise_corr(data, columns=['Neuroticism', 'Age'], covar='Age')

feature request :construction: docs/testing:book:

opened by JoElfner 13

Assertion failure (-1 <= r <= 1) in power.py when running rm_corr

Hi there, Thanks for providing this awesome library. I picked it up to run an rm_corr test on some data but I'm having issues.

I have 1000 individuals for each of whom I have 8 paired measurements. I have uploaded a pickled dataframe that causes the issue to GDrive (https://drive.google.com/open?id=1gkvGKBth2J8EhmhblsWRdK-eCDlk8zJk).

Here's a snapshot of what the dataframe looks like:

You can reproduce the issue by running:

df = pd.read_pickle('bad_df.pkl')
rm_corr(data=df, x='m2', y='m1', subject='id')

which yields the error

AssertionError                            Traceback (most recent call last)
<ipython-input-156-86b16d57ef4b> in <module>
----> 1 rm_corr(data=df, x='m2', y='m1', subject='id')

.../lib/python3.7/site-packages/pingouin/correlation.py in rm_corr(data, x, y, subject, tail)
    981     pval = pval * 0.5 if tail == 'one-sided' else pval
    982     ci = compute_esci(stat=rm, nx=n, eftype='pearson').tolist()
--> 983     pwr = power_corr(r=rm, n=n, tail=tail)
    984     # Convert to Dataframe
    985     stats = pd.DataFrame({"r": round(rm, 3), "dof": int(dof),

.../lib/python3.7/site-packages/pingouin/power.py in power_corr(r, n, power, alpha, tail)
    902     # Safety checks
    903     if r is not None:
--> 904         assert -1 <= r <= 1
    905         r = abs(r)
    906     if alpha is not None:

If I drop into a debugger I can see that r = 1.0000184067367273, perhaps this is due to numerical stability issues?

feature request :construction: invalid :triangular_flag_on_post:

opened by willprice 13

Partial correlations out of range when variance = 0
The following reproducer:

from pandas import DataFrame from random import seed, sample import pingouin as pg seed(0, 1) sample_size = 100 random_columns = 60 n = [0.0000001 for _ in range(sample_size)] df = DataFrame(dict( n1b=n, n2b=n, n3b=n, **{ f'x{i}': sample(range(2000), sample_size) for i in range(random_columns) } )) pg.pcorr(df).max().max()

gives a value of 6.754645467173336 with pingouin 0.4.0. I encountered this behaviour (giving out of range coefficients) in the wild when I accidentally forgot to remove variables with no variance.

To ensure that incorrect results like that don't get published, I would:

add a check on the data verifying that var() != 0 prior to running computations; not sure if this should raise or warn

add a check on the results; if any of the coefficients are out of range it would warn and provide name of the column those come from and a helpful message like "your data may have close-to-zero variance variables or be otherwise ill-conditioned".

It might be also worth investigating if there is a way to workaround these numerical issues, but the two points above should address the most urgent worries.
invalid :triangular_flag_on_post: URGENT :warning:
opened by krassowski 12
Skipped correlation gives different results than the Pernet implementation in Matlab
I think this line in correlation.py:

dis[i, :] = np.linalg.norm(B * B2[i, :] / bot[i], axis=1)

is functionally different compared to the Matlab implementation and the equation in Wilcox, R. (2004)
invalid :triangular_flag_on_post:
opened by adamnarai 11
adding option to return stats as float, rather than str

This PR is a work in progress.

I'm not sure whether there is a general guideline about returning statistics as formatted strings, rather than as float, but I find it very useful to use the raw float values, rather than get strings. This PR adds this functionality, at least to the bayesian module (not sure what the situation is in other modules).

This PR leaves the default behaviour of returning strings untouched. It could be argued that returning the raw float values should be default, and formatting left to the user in general (or some utility function perhaps). I'm interested to hear thoughts on this as well.

If you think in principle this PR is a good idea, I'll update and add proper documentation etc.
feature request :construction:

opened by Spaak 11

TestParametric::test_pandas fails with dtype mismatch on 32-bit platforms

I’m helping to update the python-pingouin package in Fedora Linux. One test, TestParametric::test_pandas, is failing on 32-bit platforms (i686 and armv7hl). I can’t easily tell if the actual problem here is in pingouin, pandas, or somewhere else (scipy, pandas-flavor, …).

=================================== FAILURES ===================================
__________________________ TestParametric.test_pandas __________________________
self = <pingouin.tests.test_pandas.TestParametric testMethod=test_pandas>
    def test_pandas(self):
        """Test pandas method.
        """
        # Test the ANOVA (Pandas)
        aov = df.anova(dv='Scores', between='Group', detailed=True)
        assert aov.equals(pg.anova(dv='Scores', between='Group', detailed=True,
                                   data=df))
        aov3_ss1 = df_aov3.anova(dv='Cholesterol', between=['Sex', 'Drug'],
                                 ss_type=1)
        aov3_ss2 = df_aov3.anova(dv='Cholesterol', between=['Sex', 'Drug'],
                                 ss_type=2)
        aov3_ss2_pg = pg.anova(dv='Cholesterol', between=['Sex', 'Drug'],
                               data=df_aov3, ss_type=2)
        assert not aov3_ss1.equals(aov3_ss2)
        assert aov3_ss2.round(3).equals(aov3_ss2_pg.round(3))
    
        # Test the Welch ANOVA (Pandas)
        aov = df.welch_anova(dv='Scores', between='Group')
        assert aov.equals(pg.welch_anova(dv='Scores', between='Group',
                                         data=df))
    
        # Test the ANCOVA
        aov = df_anc.ancova(dv='Scores', covar='Income',
                            between='Method').round(3)
        assert aov.equals(pg.ancova(data=df_anc, dv='Scores', covar='Income',
                          between='Method').round(3))
    
        # Test the repeated measures ANOVA (Pandas)
        aov = df.rm_anova(dv='Scores', within='Time', subject='Subject',
                          detailed=True)
        assert aov.equals(pg.rm_anova(dv='Scores', within='Time',
                                      subject='Subject',
                                      detailed=True, data=df))
    
        # FDR-corrected post hocs with Hedges'g effect size
        ttests = df.pairwise_ttests(dv='Scores', within='Time',
                                    subject='Subject', padjust='fdr_bh',
                                    effsize='hedges')
        assert ttests.equals(pg.pairwise_ttests(dv='Scores', within='Time',
                                                subject='Subject',
                                                padjust='fdr_bh',
                                                effsize='hedges', data=df))
    
        # Pairwise Tukey
        tukey = df.pairwise_tukey(dv='Scores', between='Group')
        assert tukey.equals(pg.pairwise_tukey(data=df, dv='Scores',
                                              between='Group'))
    
        # Test two-way mixed ANOVA
        aov = df.mixed_anova(dv='Scores', between='Group', within='Time',
                             subject='Subject', correction=False)
        assert aov.equals(pg.mixed_anova(dv='Scores', between='Group',
                                         within='Time',
                                         subject='Subject', correction=False,
                                         data=df))
    
        # Test parwise correlations
        corrs = data.pairwise_corr(columns=['X', 'M', 'Y'], method='spearman')
        corrs2 = pg.pairwise_corr(data=data, columns=['X', 'M', 'Y'],
                                  method='spearman')
        assert corrs['r'].equals(corrs2['r'])
    
        # Test partial correlation
        corrs = data.partial_corr(x='X', y='Y', covar='M', method='spearman')
        corrs2 = pg.partial_corr(x='X', y='Y', covar='M', method='spearman',
                                 data=data)
        assert corrs['r'].equals(corrs2['r'])
    
        # Test partial correlation matrix (compare with the ppcor package)
        corrs = data.iloc[:, :5].pcorr().round(3)
        np.testing.assert_array_equal(corrs.iloc[0, :].to_numpy(),
                                      [1, 0.392, 0.06, -0.014, -0.149])
        # Now compare against Pingouin's own partial_corr function
        corrs = data[['X', 'Y', 'M']].pcorr()
        corrs2 = data.partial_corr(x='X', y='Y', covar='M')
        assert np.isclose(corrs.at['X', 'Y'], corrs2.at['pearson', 'r'])
    
        # Test rcorr (correlation matrix with p-values)
        # We compare against Pingouin pairwise_corr function
        corrs = df_corr.rcorr(padjust='holm', decimals=4)
        corrs2 = df_corr.pairwise_corr(padjust='holm').round(4)
        assert corrs.at['Neuroticism', 'Agreeableness'] == '*'
        assert (corrs.at['Agreeableness', 'Neuroticism'] ==
                str(corrs2.at[2, 'r']))
        corrs = df_corr.rcorr(padjust='holm', stars=False, decimals=4)
        assert (corrs.at['Neuroticism', 'Agreeableness'] ==
                str(corrs2.at[2, 'p-corr'].round(4)))
        corrs = df_corr.rcorr(upper='n', decimals=5)
        corrs2 = df_corr.pairwise_corr().round(5)
        assert corrs.at['Extraversion', 'Openness'] == corrs2.at[4, 'n']
        assert corrs.at['Openness', 'Extraversion'] == str(corrs2.at[4, 'r'])
        # Method = spearman does not work with Python 3.5 on Travis?
        # Instead it seems to return the Pearson correlation!
>       df_corr.rcorr(method='spearman')
aov        =         Source        SS  DF1  DF2  ...         F     p-unc       np2       eps
0        Group  5.459963    1   58  ......064929  0.998751
2  Interaction  5.167192    2  116  ...  2.727996  0.069545  0.044922       NaN
[3 rows x 9 columns]
aov3_ss1   =        Source         SS    DF        MS         F     p-unc       np2
0         Sex   3.362769   1.0  3.362769  3.461...62   2.0  0.561181  0.577636  0.564168  0.018007
3    Residual  61.205378  63.0  0.971514       NaN       NaN       NaN
aov3_ss2   =        Source         SS    DF        MS         F     p-unc       np2
0         Sex   3.419789   1.0  3.419789  3.520...62   2.0  0.561181  0.577636  0.564168  0.018007
3    Residual  61.205378  63.0  0.971514       NaN       NaN       NaN
aov3_ss2_pg =        Source         SS    DF        MS         F     p-unc       np2
0         Sex   3.419789   1.0  3.419789  3.520...62   2.0  0.561181  0.577636  0.564168  0.018007
3    Residual  61.205378  63.0  0.971514       NaN       NaN       NaN
corrs      =                   Neuroticism Extraversion  ... Agreeableness Conscientiousness
Neuroticism                 -         ...              500
Conscientiousness    -0.36801      0.06459  ...       0.15867                 -
[5 rows x 5 columns]
corrs2     =                X                  Y   method  ...    p-unc       BF10    power
0    Neuroticism       Extraversion  pe...  0.059  0.06035
9  Agreeableness  Conscientiousness  pearson  ...  0.00037     31.243  0.94638
[10 rows x 10 columns]
self       = <pingouin.tests.test_pandas.TestParametric testMethod=test_pandas>
ttests     =   Contrast        A        B  Paired  ...    p-corr  p-adjust   BF10    hedges
0     Time   August  January    True  ....  4.232 -0.482547
2     Time  January     June    True  ...  0.310194    fdr_bh  0.232 -0.169520
[3 rows x 13 columns]
tukey      =          A           B   mean(A)  ...         T   p-tukey    hedges
0  Control  Meditation  5.567851  ... -2.289903  0.023202 -0.339918
[1 rows x 9 columns]
pingouin/tests/test_pandas.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3.10/site-packages/pandas_flavor/register.py:29: in __call__
    return method(self._obj, *args, **kwargs)
        args       = ()
        kwargs     = {'method': 'spearman'}
        method     = <function rcorr at 0xbb3880b8>
        self       = <pandas_flavor.register.register_dataframe_method.<locals>.inner.<locals>.AccessorMethod object at 0xa8881f10>
pingouin/correlation.py:1050: in rcorr
    mat = self.corr(method=method).round(decimals)
        decimals   = 3
        ffp        = <function format_float_positional at 0xc3ab3028>
        method     = 'spearman'
        padjust    = None
        pearsonr   = <function pearsonr at 0xbbb59e38>
        pval_stars = {0.001: '***', 0.01: '**', 0.05: '*'}
        self       =      Neuroticism  Extraversion  Openness  Agreeableness  Conscientiousness
0        2.47917       4.20833   3.93750   ...3            3.39583
499      2.54167       3.56250   3.14583        3.45833            2.89583
[500 rows x 5 columns]
        spearmanr  = <function spearmanr at 0xbbb5f388>
        stars      = True
        tif        = <function triu_indices_from at 0xc39efcd0>
        upper      = 'pval'
/usr/lib/python3.10/site-packages/pandas/core/frame.py:9376: in corr
    correl = libalgos.nancorr_spearman(mat, minp=min_periods)
        cols       = Index(['Neuroticism', 'Extraversion', 'Openness', 'Agreeableness',
       'Conscientiousness'],
      dtype='object')
        idx        = Index(['Neuroticism', 'Extraversion', 'Openness', 'Agreeableness',
       'Conscientiousness'],
      dtype='object')
        mat        = array([[2.47917, 4.20833, 3.9375 , 3.95833, 3.45833],
       [2.60417, 3.1875 , 3.95833, 3.39583, 3.22917],
       [2.... 3.     ],
       [2.08333, 3.66667, 3.58333, 3.45833, 3.39583],
       [2.54167, 3.5625 , 3.14583, 3.45833, 2.89583]])
        method     = 'spearman'
        min_periods = 1
        numeric_df =      Neuroticism  Extraversion  Openness  Agreeableness  Conscientiousness
0        2.47917       4.20833   3.93750   ...3            3.39583
499      2.54167       3.56250   3.14583        3.45833            2.89583
[500 rows x 5 columns]
        self       =      Neuroticism  Extraversion  Openness  Agreeableness  Conscientiousness
0        2.47917       4.20833   3.93750   ...3            3.39583
499      2.54167       3.56250   3.14583        3.45833            2.89583
[500 rows x 5 columns]
pandas/_libs/algos.pyx:415: in pandas._libs.algos.nancorr_spearman
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   ValueError: Buffer dtype mismatch, expected 'const intp_t' but got 'long long'
pandas/_libs/algos.pyx:938: ValueError

On 32-bit Linux platforms, long long is normally 64-bit but intp_t is 32-bit, which explains what is happening but not why it is happening.

I can easily do any tests that are needed on these platforms to try to figure out where the root issue is, but some ideas regarding what to look for would be very helpful.

docs/testing:book:

opened by musicinmybrain 9

description in pg.pairwise_tests for non-parametrics

The pairwise_tests module has a return_desc option that is suitable for parametric data. However, if non-parametric data is being used, medians and IQRs can be returned by setting the parametric parameter to False. I am willing to implement this feature in the relevant file at the provided GitHub link if it is desired. code line:
feature request :construction:

opened by turkalpmd 2
Reporting Extra GG Corrected Values in rm_anova

I believe it would be a great addition to the package! Generally, when reporting GG corrected p-values, the corrected F value and ddofs are also reported. It is also a relatively easy addition. (...) if correction: (...)
corr_fval = f(corr_ddof1, corr_ddof2).ppf(1-p_corr)

then during the formation of dataframe: if not detailed: (....) aov["F"] = corr_fval aov["DF"] = [corr_ddof1,corr_ddof2] else: (...) aov["DF"] = [corr_ddof1,corr_ddof2] aov["F"] = [corr_fval,np.nan] (...)
feature request :construction:

opened by Esmerald1no 1
Add Z tests (proportions & means)
Hi,

1. Intro

As mentioned in discussion #296, it would be convenient to have support for both proportion and mean z tests within pingouin.

statsmodels already provides several methods around those tests but in a confused and dispersed way, so you often end up writing some wrapper. Moreover, those methods sometimes have strange signatures, making it not really straightforward to use. A single and richer pandas output would be far more helpful. We are talking about basics here: in my opinion it would strengthen pingouin's position as a user-friendly but powerful and complete statistical package.

2. Ressources

statsmodels

Proportions

z_stat and p_value: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportions_ztest.html

diff CI: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.confint_proportions_2indep.html

power: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.power_proportions_2indep.html

effect size: https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportion_effectsize.html

Means

z_stat and p_value: https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html?highlight=ztest#statsmodels.stats.weightstats.ztest OR https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.CompareMeans.ztest_ind.html#statsmodels.stats.weightstats.CompareMeans.ztest_ind

diff CI: https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.zconfint.html?highlight=ztest OR https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.CompareMeans.zconfint_diff.html#statsmodels.stats.weightstats.CompareMeans.zconfint_diff

R

Means

Paired Z test: https://rpubs.com/nguyenminhsang/paired_z-test

3. Feature

proportions_ztest()

parameters

x1: 2-column array_like, 1st column: number of trials, 2nd column: number of successes

x2: same or proportion value

alternative

paired

r

confidence

method: method for computing confidence interval, ‘newcomb’ (default), ‘wald’, ‘agresti-caffo’, ‘score’

returns

Z

alternative

p_val

CI95%: diff in proportion

cohen-d

BF10

power

means_ztest()

parameters

x1: array_like

x2: same or mean value

alternative

paired

r

confidence

returns

Z

alternative

p_val

CI95%: diff in means

cohen-d

BF10

power

Thanks! Aurélien
feature request :construction:
opened by aurel-p 1
Roadmap for release 0.6.0
The following issues should be addressed:

[x] #308 ❗

[ ] #153

[ ] #253

[ ] #208

[ ] #225

[ ] #118

The following PR should be merged:

[ ] #226

[x] #291

The following dependencies should be updated:

[x] Scikit-learn <1.1.2 (revert temporary fix in #278)

The following deprecation should be made:
help wanted :bell: IMPORTANT❗
opened by raphaelvallat 3
Support for Bayesian Credible Intervals
Hello, is there an implementation for Bayesian credible intervals as there is for frequentist confidence intervals?

https://en.wikipedia.org/wiki/Credible_interval

Helpful Literature:

https://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/

feature request :construction:
opened by filipmarkoski 1

Releases(v0.5.3)

v0.5.3(Dec 29, 2022)
This is a minor release with a few bugfixes, several improvements and one new function/pandas.DataFrame method. Read the changelog at https://pingouin-stats.org/changelog.html

What's Changed

Fix numerical stability issue in multivariate_normality by @gkanwar in https://github.com/raphaelvallat/pingouin/pull/292

Add new function for pairwise T-tests between columns of a dataframe (pingouin.ptests) by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/291

Handle single-sample comparsion in pairwise_test by @George3d6 in https://github.com/raphaelvallat/pingouin/pull/299

Change TestRegression class test methods to fix victim flakiness by @blazyy in https://github.com/raphaelvallat/pingouin/pull/303

Add aesthetic flexibility to plot_rm_corr by @remrama in https://github.com/raphaelvallat/pingouin/pull/312

Update distribution.py by @ALL-SPACE-Rob in https://github.com/raphaelvallat/pingouin/pull/310

Plotting seaborn.FacetGrid compatibility by @remrama in https://github.com/raphaelvallat/pingouin/pull/314

Use scikit-learn>=1.1.2 by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/300

Plot shift documentation PR by @turkalpmd in https://github.com/raphaelvallat/pingouin/pull/320

Fix pandas warning by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/323

Deal with small sample size in pingouin.normality when using long-format by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/324

Renamed 'r' with 'pointbiserialr' in convert_effsize by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/325

Exact calculation of effect sizes in pairwise_tukey and pairwise_gameshowell by @raphaelvallat in https://github.com/raphaelvallat/pingouin/pull/328

New Contributors

@gkanwar made their first contribution in https://github.com/raphaelvallat/pingouin/pull/292

@George3d6 made their first contribution in https://github.com/raphaelvallat/pingouin/pull/299

@blazyy made their first contribution in https://github.com/raphaelvallat/pingouin/pull/303

@remrama made their first contribution in https://github.com/raphaelvallat/pingouin/pull/312

@ALL-SPACE-Rob made their first contribution in https://github.com/raphaelvallat/pingouin/pull/310

@turkalpmd made their first contribution in https://github.com/raphaelvallat/pingouin/pull/320

Source code(tar.gz)
Source code(zip)
v0.5.2(Jun 24, 2022)

Bugfixes

a. The eta-squared (n2) effect size was not properly calculated in one-way and two-way repeated measures ANOVAs. Specifically, Pingouin followed the same behavior as JASP, i.e. the eta-squared was the same as the partial eta-squared. However, as explained in #251, this behavior is not valid. In one-way ANOVA design, the eta-squared should be equal to the generalized eta-squared. As of March 2022, this bug is also present in JASP. We have therefore updated the unit tests to use JAMOVI instead.

Please double check any effect sizes previously obtained with the pingouin.rm_anova function!

b. Fixed invalid resampling behavior for bivariate functions in pingouin.compute_bootci when x and y were not paired. #281 c. Fixed bug where confidence (previously ci) was ignored when calculating the bootstrapped confidence intervals in pingouin.plot_shift. #282

Enhancements

a. The pingouin.pairwise_ttests has been renamed to pingouin.pairwise_tests. Non-parametric tests are also supported in this function with the parametric=False argument, and thus the name "ttests" was misleading #209 b. Allow pingouin.bayesfactor_binom to take Beta alternative model. #252 c. Allow keyword arguments for logistic regression in pingouin.mediation_analysis. #245 d. Speed improvements for the Holm and FDR correction in pingouin.multicomp. #271 e. Speed improvements univariate functions in pingouin.compute_bootci (e.g. func="mean" is now vectorized). f. Rename eta to eta_squared in pingouin.power_anova andpingouin.power_rm_anova to avoid any confusion. #280 g. Add support for DataMatrix objects. #286 h. Use black for code formatting.
Source code(tar.gz)
Source code(zip)
v0.5.1(Feb 20, 2022)
Pingouin 0.5.1

This is a minor release, with several bugfixes and improvements. This release is compatible with SciPy 1.8 and Pandas 1.4.

Bugfixes

Added support for SciPy 1.8 and Pandas 1.4. https://github.com/raphaelvallat/pingouin/pull/234

Fixed bug where pingouin.rm_anova() and pingouin.mixed_anova() changed the dtypes of categorical columns in-place https://github.com/raphaelvallat/pingouin/issues/224

Enhancements

Faster implementation of pingouin.gzscore(), adding all options available in zscore: axis, ddof and nan_policy. Warning: this function is deprecated and will be removed in the next version of Pingouin (use scipy.stats.gzscore() instead). https://github.com/raphaelvallat/pingouin/pull/210.

Replace use of statsmodels’ studentized range distribution functions with more SciPy’s more accurate scipy.stats.studentized_range(). https://github.com/raphaelvallat/pingouin/pull/229.

Add support for optional keywords argument in the pingouin.homoscedasticity() function https://github.com/raphaelvallat/pingouin/issues/218

Add support for the Jarque-Bera test in pingouin.normality() https://github.com/raphaelvallat/pingouin/issues/216.

Lastly, we have also deprecated the Gitter forum in favor of GitHub Discussions. Please use Discussions to ask questions, share ideas / tips and engage with the Pingouin community!
Source code(tar.gz)
Source code(zip)
v0.5.0(Oct 28, 2021)

This is a major release with several important bugfixes. We recommend all users to upgrade to this new version.

See the full changelog at: https://pingouin-stats.org/changelog.html#v0-5-0-october-2021
Source code(tar.gz)
Source code(zip)
v0.4.0(Aug 13, 2021)

This is a major release with an important upgrade of the dependencies (requires Python 3.7+ and SciPy 1.7+), several enhancements in existing function and a new function to test the equality of covariance matrices (pingouin.box_m). We recommend all users to upgrade to the latest version of Pingouin.

See the full changelog at: https://pingouin-stats.org/changelog.html#v0-4-0-august-2021
Source code(tar.gz)
Source code(zip)
v0.3.12(May 27, 2021)

This release fixes a critical error in pingouin.partial_corr: the number of covariates was not taken into account when calculating the degrees of freedom of the partial correlation, thus leading to incorrect results (except for the correlation coefficient which remained unaffected). For more details, please see https://github.com/raphaelvallat/pingouin/issues/171.

For the full changelog, please see https://pingouin-stats.org/changelog.html
Source code(tar.gz)
Source code(zip)
v0.3.11(Apr 14, 2021)

This is a minor release with several bug fixes in pingouin.corr. The full changelog can be found here.
Source code(tar.gz)
Source code(zip)
v0.3.10(Feb 16, 2021)

This release fixes an error in the calculation of the p-values in the pg.pairwise_tukey() and pg.pairwise_gameshowell() functions (https://github.com/raphaelvallat/pingouin/pull/156). Old versions of Pingouin used an incorrect algorithm for the studentized range approximation, which resulted in (slightly) incorrect p-values. In most cases, the error did not seem to affect the significance of the p-values. The new version of Pingouin uses statsmodels to estimate the p-values.
Source code(tar.gz)
Source code(zip)
v0.3.9(Jan 19, 2021)

See changelog at: https://pingouin-stats.org/changelog.html
Source code(tar.gz)
Source code(zip)
v0.3.8(Sep 7, 2020)
Important bugfix in pingouin.ttest() in which the 95% confidence intervals for one-sample T-test with y != 0 were invalid.

Added an "options" module to control global rounding/display behavior.

Several enhancements / new features in existing functions.

See full changelog at: https://pingouin-stats.org/changelog.html
Source code(tar.gz)
Source code(zip)
v0.3.7(Jul 29, 2020)

Hotfix release. See full changelog at: https://pingouin-stats.org/changelog.html
Source code(tar.gz)
Source code(zip)
v0.3.6(Jul 2, 2020)

See full changelog at: https://pingouin-stats.org/changelog.html
Source code(tar.gz)
Source code(zip)
v0.3.5(Jun 14, 2020)

Minor release. See full changelog at: https://pingouin-stats.org/changelog.html
Source code(tar.gz)
Source code(zip)
v0.3.4(May 7, 2020)

See full changelog at https://pingouin-stats.org/changelog.html
Source code(tar.gz)
Source code(zip)
v0.3.3(Feb 5, 2020)
Minor release:

Bugfixes

Fixed a bug in pingouin.pairwise_corr caused by the deprecation of pandas.core.index in the new version of Pandas (1.0). For now, both Pandas 0.25 and Pandas 1.0 are supported.

The standard deviation in pingouin.pairwise_ttests when using return_desc=True is now calculated with np.nanstd(ddof=1) to be consistent with Pingouin/Pandas default unbiased standard deviation.

New functions

Added the pingouin.plot_circmean function to plot the circular mean and circular vector length of a set of angles (in radians) on the unit circle. Note that this function is still in beta and some parameters may change without warnings in the next releases.

Source code(tar.gz)
Source code(zip)
v0.3.2(Jan 18, 2020)
Hotfix release to fix a critical issue with pingouin.pairwise_ttests() (see below). We strongly recommend that you update to the newest version of Pingouin and double-check your previous results if you’ve ever used the pairwise T-tests with more than one factor (e.g. mixed, factorial or 2-way repeated measures design).

Bugfixes

MAJOR: Fixed a bug in pingouin.pairwise_ttests() when using mixed or two-way repeated measures design. Specifically, the T-tests were performed without averaging over repeated measurements first (i.e. without calculating the marginal means). Note that for mixed design, this only impacts the between-subject T-test(s). Practically speaking, this led to higher degrees of freedom (because they were conflated with the number of repeated measurements) and ultimately incorrect T and p-values because the assumption of independence was violated. Pingouin now averages over repeated measurements in mixed and two-way repeated measures design, which is the same behavior as JASP or JAMOVI. As a consequence, and when the data has only two groups, the between-subject p-value of the pairwise T-test should be (almost) equal to the p-value of the same factor in the pingouin.mixed_anova() function. The old behavior of Pingouin can still be obtained using the marginal=False argument.

Minor: Added a check in pingouin.mixed_anova() to ensure that the subject variable has a unique set of values for each between-subject group defined in the between variable. For instance, the subject IDs for group1 are [1, 2, 3, 4, 5] and for group2 [6, 7, 8, 9, 10]. The function will throw an error if there are one or more overlapping subject IDs between groups (e.g. the subject IDs for group1 AND group2 are both [1, 2, 3, 4, 5]).

Minor: Fixed a bug which caused the pingouin.plot_rm_corr() and pingouin.ancova() (with >1 covariates) to throw an error if any of the input variables started with a number (because of statsmodels / Patsy formula formatting).

Enhancements

Upon loading, Pingouin will now use the outdated package to check and warn the user if a newer stable version is available.

Globally removed the export_filename parameter, which allowed to export the output table to a .csv file. This helps simplify the API and testing. As an alternative, one can simply use pandas.to_csv() to export the output dataframe generated by Pingouin.

Added the correction argument to pingouin.pairwise_ttests() to enable or disable Welch’s correction for independent T-tests.

Source code(tar.gz)
Source code(zip)
v0.3.1(Dec 3, 2019)
Minor release with some bugfixes

Fixed a bug in which missing values were removed from all columns in the dataframe in pingouin.kruskal(), even columns that were unrelated. See https://github.com/raphaelvallat/pingouin/issues/74.

The pingouin.power_corr() function now throws a warning and return a np.nan when the sample size is too low (and not an error like in previous version). This is to improve compatibility with the pingouin.pairwise_corr() function.

Fixed quantile direction in the pingouin.plot_shift() function. In v0.3.0, the quantile subplot was incorrectly labelled as Y - X, but it was in fact calculating X - Y. See https://github.com/raphaelvallat/pingouin/issues/73

Source code(tar.gz)
Source code(zip)
v0.3.0(Nov 14, 2019)
New functions

Added pingouin.plot_rm_corr() to plot a repeated measures correlation

Enhancements

Added the relimp argument to pingouin.linear_regression() to return the relative importance (= contribution) of each individual predictor to the R^2 of the full model.

Complete refactoring of pingouin.intraclass_corr() to closely match the R implementation in the psych package. Pingouin now returns the 6 types of ICC, together with F values, p-values, degrees of freedom and confidence intervals.

The pingouin.plot_shift() now 1) uses the Harrel-Davis robust quantile estimator in conjunction with a bias-corrected bootstrap confidence intervals, and 2) support paired samples.

Added the axis argument to pingouin.harrelldavis() to support 2D arrays.

Source code(tar.gz)
Source code(zip)
v0.2.9(Sep 4, 2019)

Minor release with mostly internal code refactoring.

See full changelog at: https://pingouin-stats.org/changelog.html#v0-2-9-september-2019
Source code(tar.gz)
Source code(zip)
v0.2.8(Jul 22, 2019)

See full changelog at: https://pingouin-stats.org/changelog.html#v0-2-8-july-2019
Source code(tar.gz)
Source code(zip)
v0.2.7(Jun 25, 2019)

This is a minor release, mainly to fix dependency issues between scipy and statsmodels.

Dependencies

a. Pingouin now requires statsmodels>=0.10.0 (latest release June 2019) and is compatible with SciPy 1.3.0.

Enhancements

a. Added support for long-format dataframe in pingouin.sphericity and pingouin.epsilon. b. Added support for two within-factors interaction in pingouin.sphericity and pingouin.epsilon (for the former, granted that at least one of them has no more than two levels.)

New functions

a. Added pingouin.power_rm_anova function.
Source code(tar.gz)
Source code(zip)
v0.2.6(Jun 3, 2019)
Bugfixes

Fixed ERROR in two-sided p-value for Wilcoxon test (pingouin.wilcoxon()), the p-values were accidentally squared, and therefore smaller. Make sure to always use the latest release of Pingouin.

pingouin.wilcoxon() now uses the continuity correction by default (the documentation was saying that the correction was applied but it was not applied in the code.)

The show_median argument of the pingouin.plot_shift() function was not working properly when the percentiles were different that the default parameters.

Dependencies

The current release of statsmodels (0.9.0) is not compatible with the newest release of Scipy (1.3.0). In order to avoid compatibility issues in the pingouin.ancova() and pingouin.anova() functions (which rely on statsmodels for certain cases), Pingouin will require SciPy < 1.3.0 until a new stable version of statsmodels is released.

New functions

Added pingouin.chi2_independence() tests.

Added pingouin.chi2_mcnemar() tests.

Added pingouin.power_chi2() function.

Added pingouin.bayesfactor_binom() function.

Enhancements

pingouin.linear_regression() now returns the residuals.

Completely rewrote pingouin.normality() function, which now support pandas DataFrame (wide & long format), multiple normality tests (scipy.stats.shapiro(), scipy.stats.normaltest()), and an automatic casewise removal of missing values.

Completely rewrote pingouin.homoscedasticity() function, which now support pandas DataFrame (wide & long format).

Faster and more accurate algorithm in pingouin.bayesfactor_pearson() (same algorithm as JASP).

Support for one-sided Bayes Factors in pingouin.bayesfactor_pearson().

Better handling of required parameters in pingouin.qqplot().

The epsilon value for the interaction term in pingouin.rm_anova() are now computed using the Greenhouse-Geisser method instead of the lower bound. A warning message has been added to the documentation to alert the user that the value might slightly differ than from R or JASP.

Contributors

Raphael Vallat

Arthur Paulino

Source code(tar.gz)
Source code(zip)
v0.2.5(Apr 29, 2019)
Major release with several bugfixes, new functions, and many internal improvements:

MAJOR BUG FIXES

Fixed error in p-values for one-sample one-sided T-test (pingouin.ttest()), the two-sided p-value was divided by 4 and not by 2, resulting in inaccurate (smaller) one-sided p-values.

Fixed global error for unbalanced two-way ANOVA (pingouin.anova()), the sums of squares were wrong, and as a consequence so were the F and p-values. In case of unbalanced design, Pingouin now computes a type II sums of squares via a call to the statsmodels package.

The epsilon factor for the interaction term in two-way repeated measures ANOVA (pingouin.rm_anova()) is now computed using the lower bound approach. This is more conservative than the Greenhouse-Geisser approach and therefore give (slightly) higher p-values. The reason for choosing this is that the Greenhouse-Geisser values for the interaction term differ than the ones returned by R and JASP. This will be hopefully fixed in future releases.

New functions

Added pingouin.multivariate_ttest() (Hotelling T-squared) test.

Added pingouin.cronbach_alpha() function.

Added pingouin.plot_shift() function.

Several functions of pandas can now be directly used as pandas.DataFrame methods.

Added pingouin.pcorr() method to compute the partial Pearson correlation matrix of a pandas.DataFrame (similar to the pcor function in the ppcor package).

The pingouin.partial_corr() now supports semi-partial correlation.

Enhancements

The pingouin.rm_corr() function now returns a pandas.DataFrame with the r-value, degrees of freedom, p-value, confidence intervals and power.

pingouin.compute_esci() now works for paired and one-sample Cohen d.

pingouin.bayesfactor_ttest() and pingouin.bayesfactor_pearson() now return a formatted str and not a float.

pingouin.pairwise_ttests() now returns the degrees of freedom (dof).

Better rounding of float in pingouin.pairwise_ttests().

Support for wide-format data in pingouin.rm_anova()

pingouin.ttest() now returns the confidence intervals around the T-values.

Missing values

pingouin.remove_na() and pingouin.remove_rm_na() are now external function documented in the API.

pingouin.remove_rm_na() now works with multiple within-factors.

pingouin.remove_na() now works with 2D arrays.

Removed the remove_na argument in pingouin.rm_anova() and pingouin.mixed_anova(), an automatic listwise deletion of missing values is applied (same behavior as JASP). Note that this was also the default behavior of Pingouin, but the user could also specify not to remove the missing values, which most likely returned inaccurate results.

The pingouin.ancova() function now applies an automatic listwise deletion of missing values.

Added remove_na argument (default = False) in pingouin.linear_regression() and pingouin.logistic_regression() functions

Missing values are automatically removed in the pingouin.anova() function.

Contributors

Raphael Vallat

Nicolas Legrand

Source code(tar.gz)
Source code(zip)
v0.2.4(Apr 5, 2019)
Major release with several new functions as well as many internal improvements.

Correlation

Added pingouin.distance_corr() (distance correlation) function.

pingouin.rm_corr() now requires at least 3 unique subjects (same behavior as the original R package).

The pingouin.pairwise_corr() is faster and returns the number of outlier if a robust correlation is used.

Added support for 2D level in the pingouin.pairwise_corr(). See Jupyter notebooks for examples.

Added support for partial correlation in the pingouin.pairwise_corr() function.

Greatly improved execution speed of pingouin.correlation.skipped() function.

Added default random state to compute the Min Covariance Determinant in the pingouin.correlation.skipped() function.

The default number of bootstrap samples for the pingouin.correlation.shepherd() function is now set to 200 (previously 2000) to increase computation speed.

pingouin.partial_corr() now automatically drops rows with missing values.

Datasets

Renamed pingouin.read_dataset() and pingouin.list_dataset() (before one needed to call these functions by calling pingouin.datasets)

Pairwise T-tests and multi-comparisons

Added support for non-parametric pairwise tests in pingouin.pairwise_ttests() function.

Common language effect size (CLES) is now reported by default in pingouin.pairwise_ttests() function.

CLES is now implemented in the pingouin.compute_effsize() function.

Better code, doc and testing for the functions in multicomp.py.

P-values adjustment methods now do not take into account NaN values (same behavior as the R function p.adjust)

Plotting

Added pingouin.plot_paired() function.

Regression

NaN are now automatically removed in pingouin.mediation_analysis().

The pingouin.linear_regression() and pingouin.logistic_regression() now fail if NaN / Inf are present in the target or predictors variables. The user must remove then before running these functions.

Added support for multiple parallel mediator in pingouin.mediation_analysis().

Added support for covariates in pingouin.mediation_analysis().

Added seed argument to pingouin.mediation_analysis() for reproducible results.

pingouin.mediation_analysis() now returns two-sided p-values computed with a permutation test.

Added pingouin.utils._perm_pval() to compute p-value from a permutation test.

Bugs and tests

Travis and AppVeyor test for Python 3.5, 3.6 and 3.7.

Better doctest & improved examples for many functions.

Fixed bug with pingouin.mad() when axis was not 0.

Source code(tar.gz)
Source code(zip)
v0.2.3(Feb 12, 2019)

See full changelog at: https://pingouin-stats.org/changelog.html
Source code(tar.gz)
Source code(zip)
v0.2.2(Dec 15, 2018)

See full changelog: https://pingouin-stats.org/changelog.html#v0-2-2-december-2018
Source code(tar.gz)
Source code(zip)
v0.2.1(Nov 20, 2018)
MINOR:

JOSS paper citation

Better confidence intervals

See full changelog: https://raphaelvallat.github.io/pingouin/build/html/changelog.html#v0-2-1-november-2018
Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 19, 2018)

MAJOR changes in code and documentation, see full changelog here: https://raphaelvallat.github.io/pingouin/build/html/changelog.html#v0-2-0-november-2018
Source code(tar.gz)
Source code(zip)
v0.1.10(Oct 10, 2018)
Minor release:

Fixed dataset names in MANIFEST.in (.csv files were not copy-pasted with pip)

Added circ_vtest function

Added multivariate_normality function (Henze-Zirkler’s Multivariate Normality Test)

Renamed functions test_normality, test_sphericity and test_homoscedasticity to normality, sphericity and homoscedasticity to avoid bugs with pytest.

Moved distribution tests from parametric.py to distribution.py

Source code(tar.gz)
Source code(zip)
v0.1.9(Oct 5, 2018)
Added partial correlations

Minor improvements in docs and binder notebooks

Source code(tar.gz)
Source code(zip)