Python factor analysis library (PCA, CA, MCA, MFA, FAMD)

Overview
prince_logo


Prince is a library for doing factor analysis. This includes a variety of methods including principal component analysis (PCA) and correspondence analysis (CA). The goal is to provide an efficient implementation for each algorithm along with a scikit-learn API.

☝️ I made this package when I was a student at university. I have very little time to work on this now that I have a full-time job. Feel to contribute and even take ownership if that sort of thing floats your boat. Thank you in advance for your understanding.

Table of contents

Installation

⚠️ Prince is only compatible with Python 3.

🐍 Although it isn't a requirement, using Anaconda is highly recommended.

Via PyPI

$ pip install prince

Via GitHub for the latest development version

$ pip install git+https://github.com/MaxHalford/Prince

Prince doesn't have any extra dependencies apart from the usual suspects (sklearn, pandas, matplotlib) which are included with Anaconda.

Usage

import numpy as np; np.random.set_state(42)  # this is for doctests reproducibility

Guidelines

Each estimator provided by prince extends scikit-learn's TransformerMixin. This means that each estimator implements a fit and a transform method which makes them usable in a transformation pipeline. The fit method is actually an alias for the row_principal_components method which returns the row principal components. However you can also access the column principal components with the column_principal_components.

Under the hood Prince uses a randomised version of SVD. This is much faster than using the more commonly full approach. However the results may have a small inherent randomness. For most applications this doesn't matter and you shouldn't have to worry about it. However if you want reproducible results then you should set the random_state parameter.

The randomised version of SVD is an iterative method. Because each of Prince's algorithms use SVD, they all possess a n_iter parameter which controls the number of iterations used for computing the SVD. On the one hand the higher n_iter is the more precise the results will be. On the other hand increasing n_iter increases the computation time. In general the algorithm converges very quickly so using a low n_iter (which is the default behaviour) is recommended.

You are supposed to use each method depending on your situation:

  • All your variables are numeric: use principal component analysis (prince.PCA)
  • You have a contingency table: use correspondence analysis (prince.CA)
  • You have more than 2 variables and they are all categorical: use multiple correspondence analysis (prince.MCA)
  • You have groups of categorical or numerical variables: use multiple factor analysis (prince.MFA)
  • You have both categorical and numerical variables: use factor analysis of mixed data (prince.FAMD)

The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper:

Principal component analysis (PCA)

If you're using PCA it is assumed you have a dataframe consisting of numerical continuous variables. In this example we're going to be using the Iris flower dataset.

>>> import pandas as pd
>>> import prince
>>> from sklearn import datasets

>>> X, y = datasets.load_iris(return_X_y=True)
>>> X = pd.DataFrame(data=X, columns=['Sepal length', 'Sepal width', 'Petal length', 'Petal width'])
>>> y = pd.Series(y).map({0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'})
>>> X.head()
   Sepal length  Sepal width  Petal length  Petal width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
3           4.6          3.1           1.5          0.2
4           5.0          3.6           1.4          0.2

The PCA class implements scikit-learn's fit/transform API. It's parameters have to passed at initialisation before calling the fit method.

>>> pca = prince.PCA(
...     n_components=2,
...     n_iter=3,
...     rescale_with_mean=True,
...     rescale_with_std=True,
...     copy=True,
...     check_input=True,
...     engine='auto',
...     random_state=42
... )
>>> pca = pca.fit(X)

The available parameters are:

  • n_components: the number of components that are computed. You only need two if your intention is to make a chart.
  • n_iter: the number of iterations used for computing the SVD
  • rescale_with_mean: whether to substract each column's mean
  • rescale_with_std: whether to divide each column by it's standard deviation
  • copy: if False then the computations will be done inplace which can have possible side-effects on the input data
  • engine: what SVD engine to use (should be one of ['auto', 'fbpca', 'sklearn'])
  • random_state: controls the randomness of the SVD results.

Once the PCA has been fitted, it can be used to extract the row principal coordinates as so:

>>> pca.transform(X).head()  # same as pca.row_coordinates(X).head()
          0         1
0 -2.264703  0.480027
1 -2.080961 -0.674134
2 -2.364229 -0.341908
3 -2.299384 -0.597395
4 -2.389842  0.646835

Each column stands for a principal component whilst each row stands a row in the original dataset. You can display these projections with the plot_row_coordinates method:

>>> ax = pca.plot_row_coordinates(
...     X,
...     ax=None,
...     figsize=(6, 6),
...     x_component=0,
...     y_component=1,
...     labels=None,
...     color_labels=y,
...     ellipse_outline=False,
...     ellipse_fill=True,
...     show_points=True
... )
>>> ax.get_figure().savefig('images/pca_row_coordinates.svg')

Each principal component explains part of the underlying of the distribution. You can see by how much by using the accessing the explained_inertia_ property:

>>> pca.explained_inertia_
array([0.72962445, 0.22850762])

The explained inertia represents the percentage of the inertia each principal component contributes. It sums up to 1 if the n_components property is equal to the number of columns in the original dataset. you The explained inertia is obtained by dividing the eigenvalues obtained with the SVD by the total inertia, both of which are also accessible.

>>> pca.eigenvalues_
array([2.91849782, 0.91403047])

>>> pca.total_inertia_
4.000000...

>>> pca.explained_inertia_
array([0.72962445, 0.22850762])

You can also obtain the correlations between the original variables and the principal components.

>>> pca.column_correlations(X)
                     0         1
Petal length  0.991555  0.023415
Petal width   0.964979  0.064000
Sepal length  0.890169  0.360830
Sepal width  -0.460143  0.882716

You may also want to know how much each observation contributes to each principal component. This can be done with the row_contributions method.

>>> pca.row_contributions(X).head()
          0         1
0  1.757369  0.252098
1  1.483777  0.497200
2  1.915225  0.127896
3  1.811606  0.390447
4  1.956947  0.457748

You can also transform row projections back into their original space by using the inverse_transform method.

>>> pca.inverse_transform(pca.transform(X)).head()
          0         1         2         3
0  5.018949  3.514854  1.466013  0.251922
1  4.738463  3.030433  1.603913  0.272074
2  4.720130  3.196830  1.328961  0.167414
3  4.668436  3.086770  1.384170  0.182247
4  5.017093  3.596402  1.345411  0.206706

Correspondence analysis (CA)

You should be using correspondence analysis when you want to analyse a contingency table. In other words you want to analyse the dependencies between two categorical variables. The following example comes from section 17.2.3 of this textbook. It shows the number of occurrences between different hair and eye colors.

>>> import pandas as pd

>>> pd.set_option('display.float_format', lambda x: '{:.6f}'.format(x))
>>> X = pd.DataFrame(
...    data=[
...        [326, 38, 241, 110, 3],
...        [688, 116, 584, 188, 4],
...        [343, 84, 909, 412, 26],
...        [98, 48, 403, 681, 85]
...    ],
...    columns=pd.Series(['Fair', 'Red', 'Medium', 'Dark', 'Black']),
...    index=pd.Series(['Blue', 'Light', 'Medium', 'Dark'])
... )
>>> X
        Fair  Red  Medium  Dark  Black
Blue     326   38     241   110      3
Light    688  116     584   188      4
Medium   343   84     909   412     26
Dark      98   48     403   681     85

Unlike the PCA class, the CA only exposes scikit-learn's fit method.

>>> import prince
>>> ca = prince.CA(
...     n_components=2,
...     n_iter=3,
...     copy=True,
...     check_input=True,
...     engine='auto',
...     random_state=42
... )
>>> X.columns.rename('Hair color', inplace=True)
>>> X.index.rename('Eye color', inplace=True)
>>> ca = ca.fit(X)

The parameters and methods overlap with those proposed by the PCA class.

>>> ca.row_coordinates(X)
               0         1
Blue   -0.400300 -0.165411
Light  -0.440708 -0.088463
Medium  0.033614  0.245002
Dark    0.702739 -0.133914

>>> ca.column_coordinates(X)
               0         1
Fair   -0.543995 -0.173844
Red    -0.233261 -0.048279
Medium -0.042024  0.208304
Dark    0.588709 -0.103950
Black   1.094388 -0.286437

You can plot both sets of principal coordinates with the plot_coordinates method.

>>> ax = ca.plot_coordinates(
...     X=X,
...     ax=None,
...     figsize=(6, 6),
...     x_component=0,
...     y_component=1,
...     show_row_labels=True,
...     show_col_labels=True
... )
>>> ax.get_figure().savefig('images/ca_coordinates.svg')

Like for the PCA you can access the inertia contribution of each principal component as well as the eigenvalues and the total inertia.

>>> ca.eigenvalues_
[0.199244..., 0.030086...]

>>> ca.total_inertia_
0.230191...

>>> ca.explained_inertia_
[0.865562..., 0.130703...]

Multiple correspondence analysis (MCA)

Multiple correspondence analysis (MCA) is an extension of correspondence analysis (CA). It should be used when you have more than two categorical variables. The idea is simply to compute the one-hot encoded version of a dataset and apply CA on it. As an example we're going to use the balloons dataset taken from the UCI datasets website.

>>> import pandas as pd

>>> X = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
>>> X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
>>> X.head()
    Color   Size   Action    Age Inflated
0  YELLOW  SMALL  STRETCH  ADULT        T
1  YELLOW  SMALL  STRETCH  CHILD        F
2  YELLOW  SMALL      DIP  ADULT        F
3  YELLOW  SMALL      DIP  CHILD        F
4  YELLOW  LARGE  STRETCH  ADULT        T

The MCA also implements the fit and transform methods.

>>> import prince
>>> mca = prince.MCA(
...     n_components=2,
...     n_iter=3,
...     copy=True,
...     check_input=True,
...     engine='auto',
...     random_state=42
... )
>>> mca = mca.fit(X)

Like the CA class, the MCA class also has plot_coordinates method.

>>> ax = mca.plot_coordinates(
...     X=X,
...     ax=None,
...     figsize=(6, 6),
...     show_row_points=True,
...     row_points_size=10,
...     show_row_labels=False,
...     show_column_points=True,
...     column_points_size=30,
...     show_column_labels=False,
...     legend_n_cols=1
... )
>>> ax.get_figure().savefig('images/mca_coordinates.svg')

The eigenvalues and inertia values are also accessible.

>>> mca.eigenvalues_
[0.401656..., 0.211111...]

>>> mca.total_inertia_
1.0

>>> mca.explained_inertia_
[0.401656..., 0.211111...]

Multiple factor analysis (MFA)

Multiple factor analysis (MFA) is meant to be used when you have groups of variables. In practice it builds a PCA on each group -- or an MCA, depending on the types of the group's variables. It then constructs a global PCA on the results of the so-called partial PCAs -- or MCAs. The dataset used in the following examples come from this paper. In the dataset, three experts give their opinion on six different wines. Each opinion for each wine is recorded as a variable. We thus want to consider the separate opinions of each expert whilst also having a global overview of each wine. MFA is the perfect fit for this kind of situation.

First of all let's copy the data used in the paper.

>>> import pandas as pd

>>> X = pd.DataFrame(
...     data=[
...         [1, 6, 7, 2, 5, 7, 6, 3, 6, 7],
...         [5, 3, 2, 4, 4, 4, 2, 4, 4, 3],
...         [6, 1, 1, 5, 2, 1, 1, 7, 1, 1],
...         [7, 1, 2, 7, 2, 1, 2, 2, 2, 2],
...         [2, 5, 4, 3, 5, 6, 5, 2, 6, 6],
...         [3, 4, 4, 3, 5, 4, 5, 1, 7, 5]
...     ],
...     columns=['E1 fruity', 'E1 woody', 'E1 coffee',
...              'E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody',
...              'E3 fruity', 'E3 butter', 'E3 woody'],
...     index=['Wine {}'.format(i+1) for i in range(6)]
... )
>>> X['Oak type'] = [1, 2, 2, 2, 1, 1]

The groups are passed as a dictionary to the MFA class.

>>> groups = {
...    'Expert #{}'.format(no+1): [c for c in X.columns if c.startswith('E{}'.format(no+1))]
...    for no in range(3)
... }
>>> import pprint
>>> pprint.pprint(groups)
{'Expert #1': ['E1 fruity', 'E1 woody', 'E1 coffee'],
 'Expert #2': ['E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody'],
 'Expert #3': ['E3 fruity', 'E3 butter', 'E3 woody']}

Now we can fit an MFA.

>>> import prince
>>> mfa = prince.MFA(
...     groups=groups,
...     n_components=2,
...     n_iter=3,
...     copy=True,
...     check_input=True,
...     engine='auto',
...     random_state=42
... )
>>> mfa = mfa.fit(X)

The MFA inherits from the PCA class, which entails that you have access to all it's methods and properties. The row_coordinates method will return the global coordinates of each wine.

>>> mfa.row_coordinates(X)
               0         1
Wine 1 -2.172155 -0.508596
Wine 2  0.557017 -0.197408
Wine 3  2.317663 -0.830259
Wine 4  1.832557  0.905046
Wine 5 -1.403787  0.054977
Wine 6 -1.131296  0.576241

Just like for the PCA you can plot the row coordinates with the plot_row_coordinates method.

>>> ax = mfa.plot_row_coordinates(
...     X,
...     ax=None,
...     figsize=(6, 6),
...     x_component=0,
...     y_component=1,
...     labels=X.index,
...     color_labels=['Oak type {}'.format(t) for t in X['Oak type']],
...     ellipse_outline=False,
...     ellipse_fill=True,
...     show_points=True
... )
>>> ax.get_figure().savefig('images/mfa_row_coordinates.svg')

You can also obtain the row coordinates inside each group. The partial_row_coordinates method returns a pandas.DataFrame where the set of columns is a pandas.MultiIndex. The first level of indexing corresponds to each specified group whilst the nested level indicates the coordinates inside each group.

>>> mfa.partial_row_coordinates(X)  # doctest: +NORMALIZE_WHITESPACE
  Expert #1           Expert #2           Expert #3
               0         1         0         1         0         1
Wine 1 -2.764432 -1.104812 -2.213928 -0.863519 -1.538106  0.442545
Wine 2  0.773034  0.298919  0.284247 -0.132135  0.613771 -0.759009
Wine 3  1.991398  0.805893  2.111508  0.499718  2.850084 -3.796390
Wine 4  1.981456  0.927187  2.393009  1.227146  1.123206  0.560803
Wine 5 -1.292834 -0.620661 -1.492114 -0.488088 -1.426414  1.273679
Wine 6 -0.688623 -0.306527 -1.082723 -0.243122 -1.622541  2.278372

Likewhise you can visualize the partial row coordinates with the plot_partial_row_coordinates method.

>>> ax = mfa.plot_partial_row_coordinates(
...     X,
...     ax=None,
...     figsize=(6, 6),
...     x_component=0,
...     y_component=1,
...     color_labels=['Oak type {}'.format(t) for t in X['Oak type']]
... )
>>> ax.get_figure().savefig('images/mfa_partial_row_coordinates.svg')

As usual you have access to inertia information.

>>> mfa.eigenvalues_
array([0.47246678, 0.05947651])

>>> mfa.total_inertia_
0.558834...

>>> mfa.explained_inertia_
array([0.84545097, 0.10642965])

You can also access information concerning each partial factor analysis via the partial_factor_analysis_ attribute.

>>> for name, fa in sorted(mfa.partial_factor_analysis_.items()):
...     print('{} eigenvalues: {}'.format(name, fa.eigenvalues_))
Expert #1 eigenvalues: [0.47709918 0.01997272]
Expert #2 eigenvalues: [0.60851399 0.03235984]
Expert #3 eigenvalues: [0.41341481 0.07353257]

The row_contributions method will provide you with the inertia contribution of each row with respect to each component.

>>> mfa.row_contributions(X)
               0         1
Wine 1  9.986433  4.349104
Wine 2  0.656699  0.655218
Wine 3 11.369187 11.589968
Wine 4  7.107942 13.771950
Wine 5  4.170915  0.050817
Wine 6  2.708824  5.582943

The column_correlations method will return the correlation between the original variables and the components.

>>> mfa.column_correlations(X)
                     0         1
E1 coffee    -0.918449 -0.043444
E1 fruity     0.968449  0.192294
E1 woody     -0.984442 -0.120198
E2 red fruit  0.887263  0.357632
E2 roasted   -0.955795  0.026039
E2 vanillin  -0.950629 -0.177883
E2 woody     -0.974649  0.127239
E3 butter    -0.945767  0.221441
E3 fruity     0.594649 -0.820777
E3 woody     -0.992337  0.029747

Factor analysis of mixed data (FAMD)

A description is on it's way. This section is empty because I have to refactor the documentation a bit.

>>> import pandas as pd

>>> X = pd.DataFrame(
...     data=[
...         ['A', 'A', 'A', 2, 5, 7, 6, 3, 6, 7],
...         ['A', 'A', 'A', 4, 4, 4, 2, 4, 4, 3],
...         ['B', 'A', 'B', 5, 2, 1, 1, 7, 1, 1],
...         ['B', 'A', 'B', 7, 2, 1, 2, 2, 2, 2],
...         ['B', 'B', 'B', 3, 5, 6, 5, 2, 6, 6],
...         ['B', 'B', 'A', 3, 5, 4, 5, 1, 7, 5]
...     ],
...     columns=['E1 fruity', 'E1 woody', 'E1 coffee',
...              'E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody',
...              'E3 fruity', 'E3 butter', 'E3 woody'],
...     index=['Wine {}'.format(i+1) for i in range(6)]
... )
>>> X['Oak type'] = [1, 2, 2, 2, 1, 1]

Now we can fit an FAMD.

>>> import prince
>>> famd = prince.FAMD(
...     n_components=2,
...     n_iter=3,
...     copy=True,
...     check_input=True,
...     engine='auto',
...     random_state=42
... )
>>> famd = famd.fit(X.drop('Oak type', axis='columns'))

The FAMD inherits from the MFA class, which entails that you have access to all it's methods and properties. The row_coordinates method will return the global coordinates of each wine.

>>> famd.row_coordinates(X)
               0         1
Wine 1 -1.488689 -1.002711
Wine 2 -0.449783 -1.354847
Wine 3  1.774255 -0.258528
Wine 4  1.565402  0.016484
Wine 5 -0.349655  1.516425
Wine 6 -1.051531  1.083178

Just like for the MFA you can plot the row coordinates with the plot_row_coordinates method.

>>> ax = famd.plot_row_coordinates(
...     X,
...     ax=None,
...     figsize=(6, 6),
...     x_component=0,
...     y_component=1,
...     labels=X.index,
...     color_labels=['Oak type {}'.format(t) for t in X['Oak type']],
...     ellipse_outline=False,
...     ellipse_fill=True,
...     show_points=True
... )
>>> ax.get_figure().savefig('images/famd_row_coordinates.svg')

Generalized procrustes analysis (GPA)

Generalized procrustes analysis (GPA) is a shape analysis tool that aligns and scales a set of shapes to a common reference. Here, the term "shape" means an ordered sequence of points. GPA iteratively 1) aligns each shape with a reference shape (usually the mean shape), 2) then updates the reference shape, 3) repeating until converged.

Note that the final rotation of the aligned shapes may vary between runs, based on the initialization.

Here is an example aligning a few right triangles:

df = pd.DataFrame(
    data=[
        [0, 0, 0, 0],
        [0, 2, 0, 1],
        [1, 0, 0, 2],
        [3, 2, 1, 0],
        [1, 2, 1, 1],
        [3, 3, 1, 2],
        [0, 0, 2, 0],
        [0, 4, 2, 1],
        [2, 0, 2, 2],
    ],
    columns=['x', 'y', 'shape', 'point']
).astype({'x': float, 'y': float})
fig, ax = plt.subplots()
sns.lineplot(
    data=df,
    x='x',
    y='y',
    hue='shape',
    style='shape',
    palette='Set2',
    markers=True,
    estimator=None,
    sort=False,
    ax=ax
    )
ax.axis('scaled')
fig.savefig('images/gpa_input_triangles.svg')

We need to convert the dataframe to a 3-D numpy array of size (shapes, points, dims). There are many ways to do this. Here, we use xarray as a helper package.

ds = df.set_index(['shape', 'point']).to_xarray()
da = ds.to_stacked_array('xy', ['shape', 'point'])
shapes = da.values

Now, we can align the shapes.

import prince
gpa = prince.GPA()
aligned_shapes = gpa.fit_transform(shapes)

We then convert the 3-D numpy array to a DataFrame (using xarray) for plotting.

da.values = aligned_shapes
df = da.to_unstacked_dataset('xy').to_dataframe().reset_index()
fig, ax = plt.subplots()
sns.lineplot(
    data=df,
    x='x',
    y='y',
    hue='shape',
    style='shape',
    palette='Set2',
    markers=True,
    estimator=None,
    sort=False,
    ax=ax
    )
ax.axis('scaled')
fig.savefig('images/gpa_aligned_triangles.svg')

The triangles were all the same shape, so they are now perfectly aligned.

Going faster

By default prince uses sklearn's randomized SVD implementation (the one used under the hood for TruncatedSVD). One of the goals of Prince is to make it possible to use a different SVD backend. For the while the only other supported backend is Facebook's randomized SVD implementation called fbpca. You can use it by setting the engine parameter to 'fbpca':

>>> import prince
>>> pca = prince.PCA(engine='fbpca')

If you are using Anaconda then you should be able to install fbpca without any pain by running pip install fbpca.

License

The MIT License (MIT). Please see the license file for more information.

Comments
  • IndexingError: Too many indexers

    IndexingError: Too many indexers

    Using example code:

    import pandas as pd
    import prince
    
    df = pd.read_csv('ogm.csv')
    mca = prince.MCA(df, n_components=-1)
    mca.plot_rows(show_points=True, show_labels=False, color_by='Position Al A', ellipse_fill=True)
    

    Error:

    Traceback (most recent call last):
      File "test.py", line 6, in <module>
        mca.plot_rows(show_points=True, show_labels=False, color_by='Position Al A', ellipse_fill=True)
      File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/prince/mca.py", line 154, in plot_rows
        ellipse_fill=ellipse_fill
      File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/prince/plot/mpl/pca.py", line 30, in row_principal_coordinates
        data = principal_coordinates.iloc[:, axes].copy() # Active rows
      File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1325, in __getitem__
        return self._getitem_tuple(key)
      File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1662, in _getitem_tuple
        self._has_valid_tuple(tup)
      File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py", line 189, in _has_valid_tuple
        if not self._has_valid_type(k, i):
      File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1599, in _has_valid_type
        return self._is_valid_list_like(key, axis)
      File "/Users/hewgreen/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1648, in _is_valid_list_like
        raise IndexingError('Too many indexers')
    pandas.core.indexing.IndexingError: Too many indexers
    

    I tried to find some details about how to fix myself to no avail. Many thanks in advance for your help.

    Regards

    hewgreen

    opened by hewgreen 18
  • MCA:

    MCA: "ValueError: All values in X should be positive"

    I have a dataframe like this: restDf.head()

      | a | b | c | d
    -- | -- | -- | -- | --
    109 | 4 | 4 | 0
    2 | 4 | 4 | 0
    243 | 4 | 6 | 1
    130 | 3 | 4 | 0
    181 | 4 | 6 | 1
    

    trying to run:

    import prince
    mca = prince.MCA()
    test = mca.fit(restDf)
    

    gives me:

    
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-75-d72ed9bacef2> in <module>()
          2 
          3 mca = prince.MCA(n_components=2)
    ----> 4 test = mca.fit(restDf)
    
    ~\AppData\Local\Continuum\anaconda3\lib\site-packages\prince\mca.py in fit(self, X, y)
         26 
         27         # Apply correspondence analysis to the indicator matrix
    ---> 28         super().fit(X)
         29 
         30         # Compute the total inertia
    
    ~\AppData\Local\Continuum\anaconda3\lib\site-packages\prince\ca.py in fit(self, X, y)
         24         # Check all values are positive
         25         if np.any(X < 0):
    ---> 26             raise ValueError("All values in X should be positive")
         27 
         28         if isinstance(X, pd.DataFrame):
    
    ValueError: All values in X should be positive
    
    

    This also happens when I run the program with the same dataframe where everything <=0 is replaced with a positive number: restDf2.head()

      | a | b | c | d
    -- | -- | -- | -- | --
    109 | 4 | 4 | 42069
    2 | 4 | 4 | 42069
    243 | 4 | 6 | 1
    130 | 3 | 4 | 42069
    181 | 4 | 6 | 1
    

    Any help?

    opened by ssssssseb 15
  • MemoryError issue

    MemoryError issue

    The memory of my machine has 120 GB, and there are 40 GB left for me to conduct MCA computation.

    The DataFrame has a shape of (1244210, 37), and I have processed the DataFrame with get_dummy() function in Pandas.

    And I want to get 10 components, however, I got MemoryError here

    >>> mca_result = prince.MCA(X_MCA, n_components=10)
    MemoryError                               Traceback (most recent call last)
    <ipython-input-20-ee2308cc121f> in <module>()
    ----> 1 mca_result = prince.MCA(X_MCA, n_components=10)
    
    /home/libertatis/anaconda3/lib/python3.6/site-packages/prince/mca.py in __init__(self, dataframe, n_components, use_benzecri_rates, plotter)
         43             dataframe=pd.get_dummies(dataframe),
         44             n_components=n_components,
    ---> 45             plotter=plotter
         46         )
         47 
    
    /home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in __init__(self, dataframe, n_components, plotter)
         26         self._set_plotter(plotter_name=plotter)
         27 
    ---> 28         self._compute_svd()
         29 
         30     def _compute_svd(self):
    
    /home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in _compute_svd(self)
         29 
         30     def _compute_svd(self):
    ---> 31         self.svd = SVD(X=self.standardized_residuals, k=self.n_components)
         32 
         33     def _set_plotter(self, plotter_name):
    
    /home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in standardized_residuals(self)
        123         """
        124         residuals = (self.P - self.expected_frequencies).values
    --> 125         return self.row_masses.dot(residuals).dot(self.column_masses)
        126 
        127     @property
    
    /home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in row_masses(self)
         99             represents the weight of the matching row; the non-diagonal cells are equal to 0.
        100         """
    --> 101         return np.diag(1 / np.sqrt(self.row_sums))
        102 
        103     @property
    
    /home/libertatis/anaconda3/lib/python3.6/site-packages/numpy/lib/twodim_base.py in diag(v, k)
        247     if len(s) == 1:
        248         n = s[0]+abs(k)
    --> 249         res = zeros((n, n), v.dtype)
        250         if k >= 0:
        251             i = k
    
    MemoryError: 
    

    And there are 40GB memories left for me and I can apply PCA to the DataFrame. How can I solve it?

    I found a similar issue on this problem: https://github.com/esafak/mca/issues/15

    opened by GoingMyWay 14
  • Is there a way to transform new data after fitting with FAMD?

    Is there a way to transform new data after fitting with FAMD?

    Hello,

    I just discovered this package and it seems very interesting. I was wondering is there a way to apply the transform function to new unseen data after calling FAMD fit? Analogous to how PCA works in sklearn.

    When I try to do this I get an error:

    X) 102 X = self.scaler_.transform(X) 103 --> 104 return pd.DataFrame(data=X.dot(self.V_.T), index=index) 105 106 def row_standard_coordinates(self, X):

    ValueError: shapes (2,20) and (49,2) not aligned: 20 (dim 1) != 49 (dim 0)

    Basically it looks like it doesn't understand there are a different number of "training examples" as opposed to when the fit occurred.

    Cheers,

    Kuhan

    opened by kuhanw 12
  • Unable to transform test data after MCA fitting the training data

    Unable to transform test data after MCA fitting the training data

    data = pd.read_csv("data/training set.csv")
    X = data.loc[:, 'OS.1':'DSA.1']
    
    df = pd.DataFrame(X)
    
    mca = prince.MCA(
                   n_components=2,
                   n_iter=3,
                   copy=True,
                   check_input=True,
                   engine='auto',
                   random_state=42
                    )
    
    mca = mca.fit(df)
    
    df_new = df.loc[0:5, :]
    I = mca.transform(df_new)
    print(I)
    
    

    Output: File "C:/../clustering/k means.py", line 62, in I = mca.transform(df_new) File "C:..\clustering\interpreter2\lib\site-packages\prince\mca.py", line 47, in transform return self.row_coordinates(X) File "C:..\clustering\interpreter2\lib\site-packages\prince\mca.py", line 37, in row_coordinates return super().row_coordinates(self.one_hot_.transform(X)) File "C:..\clustering\interpreter2\lib\site-packages\prince\ca.py", line 111, in row_coordinates X = X / X.sum(axis=1) File "C:\python36\lib\site-packages\scipy\sparse\base.py", line 1015, in sum np.ones((n, 1), dtype=res_dtype)) File "C:\python36\lib\site-packages\scipy\sparse\base.py", line 499, in mul result = self._mul_vector(np.ravel(other)) File "C:\python36\lib\site-packages\scipy\sparse\coo.py", line 571, in _mul_vector other.dtype.char)) File "C:\python36\lib\site-packages\scipy\sparse\sputils.py", line 60, in upcast_char t = upcast(*map(np.dtype, args)) File "C:\python36\lib\site-packages\scipy\sparse\sputils.py", line 52, in upcast raise TypeError('no supported conversion for types: %r' % (args,)) TypeError: no supported conversion for types: (dtype('O'), dtype('O'))

    This is how data looks like

    print(df_new)
    
    output:
        0  1   2   3
    0  9  8   9   9
    1  8  7   8   6
    2  8  7   9   9
    3  8  7   9   9
    4  8  7   8   7
    5  9  8  10  10
    
    

    python 3.6.4 scikit 0.20.2 numpy 1.16.1 pandas 0.24.1

    opened by adeebabdulsalam 11
  • MCA : 'SparseDataFrame' object has no attribute 'to_numpy'

    MCA : 'SparseDataFrame' object has no attribute 'to_numpy'

    I'm trying to execute the following code -

    import prince
    mca = prince.MCA(n_components=10,n_iter=5,copy=True,check_input=True,engine='auto',random_state=10)
    mca_model=mca.fit(df_sample)
    

    The problem is that OneHotEncoder has sparse set to True, so it returns a sparse dataframe. But the fit function has the following piece of code -

    if isinstance(X, pd.DataFrame):
                X = X.to_numpy()
    

    But to_numpy cannot be applied to sparse dataframes. As a result I'm getting the following error-

    AttributeError: 'SparseDataFrame' object has no attribute 'to_numpy'

    Not sure if this is a bug or if i'm doing something wrong.

    opened by kaustubrao 10
  • FAMD implementation

    FAMD implementation

    Hi,

    Any updates on the FAMD? Trying to get some staticstical analysis work done using python, but unfortunately cant find many tools. Appreciate the effort you have put into this package though!

    opened by thusithaC 10
  • calculate MCA cos2 and result comparability with MCA library

    calculate MCA cos2 and result comparability with MCA library

    I meet some difficulties to compare MCA results between Prince and MCA library, especially to calculate cos2. There is a ratio difference on each component : is there any reason ? (am I wrong to use the ratio of factor square on squared row norms (cos2(r,c) = F(r,c)**2/||r||**2 for r the row and c the component - in my case I have cos2>1, so kind of wrong).

    bug 
    opened by emilienschultz 8
  • Contributing to prince

    Contributing to prince

    Is there any way I can contribute to the project? I use this package a lot and I have added methods which I would like to be part of the original package.

    opened by RohanVardhan 8
  • Support pandas 1.0

    Support pandas 1.0

    Pandas 1.0 removed some functionalities previously deprecated, and some features seems to be broken. For example, trying to fit a FAMD gives the following error:

    ~/.pyenv/versions/3.8.2/envs/bioinformatics/lib/python3.8/site-packages/prince/one_hot.py in transform(self, X)
         29 
         30     def transform(self, X):
    ---> 31         return pd.SparseDataFrame(
         32             data=super().transform(X),
         33             columns=self.column_names_,
    
    TypeError: SparseDataFrame() takes no arguments
    
    opened by dialvarezs 6
  • problem with FAMD

    problem with FAMD

    Hi , when I try to use FAMD to my mix dataset (contains continuous and categorical variables) an error message appears:

    ValueError: FAMD works with categorical and numerical data but you only have categorical data; you should consider using MCA

    I'm sure I have mix data, but I don't know what is the problem ! any help please?

    opened by Mory1989 6
  • Prince MCA transformation error on new data

    Prince MCA transformation error on new data

    I am using the newest version of prince (0.7.1), but this seems to be an issue on previous versions as well.

    When I run the below code, i get no error. mca = prince.MCA(n_components=2) mca.fit(train_data) x = mca.transform(train_data)

    But when i try to apply the same model on test data: y = mca.transform(test_data)

    I get the below error: _ValueError Traceback (most recent call last) in ----> 1 y = mca.transform(test_data) 2 print(y)

    /usr/local/anaconda3/envs/upe-pipeline/lib/python3.8/site-packages/prince/mca.py in transform(self, X) 48 if self.check_input: 49 utils.check_array(X, dtype=[str, np.number]) ---> 50 return self.row_coordinates(X) 51 52 def plot_coordinates(self, X, ax=None, figsize=(6, 6), x_component=0, y_component=1,

    /usr/local/anaconda3/envs/upe-pipeline/lib/python3.8/site-packages/prince/mca.py in row_coordinates(self, X) 36 if not isinstance(X, pd.DataFrame): 37 X = pd.DataFrame(X) ---> 38 return super().row_coordinates(pd.get_dummies(X)) 39 40 def column_coordinates(self, X):

    /usr/local/anaconda3/envs/upe-pipeline/lib/python3.8/site-packages/prince/ca.py in row_coordinates(self, X) 132 133 return pd.DataFrame( --> 134 data=X @ sparse.diags(self.col_masses_.to_numpy() ** -0.5) @ self.V_.T, 135 index=row_names 136 )

    /usr/local/anaconda3/envs/upe-pipeline/lib/python3.8/site-packages/scipy/sparse/base.py in rmatmul(self, other) 564 raise ValueError("Scalar operands are not allowed, " 565 "use '*' instead") --> 566 return self.rmul(other) 567 568 ####################

    /usr/local/anaconda3/envs/upe-pipeline/lib/python3.8/site-packages/scipy/sparse/base.py in rmul(self, other) 548 except AttributeError: 549 tr = np.asarray(other).transpose() --> 550 return (self.transpose() * tr).transpose() 551 552 #######################

    /usr/local/anaconda3/envs/upe-pipeline/lib/python3.8/site-packages/scipy/sparse/base.py in mul(self, other) 514 515 if other.shape[0] != self.shape[1]: --> 516 raise ValueError('dimension mismatch') 517 518 result = self._mul_multivector(np.asarray(other))

    ValueError: dimension mismatch_

    I am not able to use the same transformation on new data, could you please help with this?

    opened by EVAUTOAI 0
  • Problem with transform method in FAMD

    Problem with transform method in FAMD

        Hello. 
    

    I tried to run the same code as you and I got an error:

    ValueError: shapes (1,4) and (6,4) not aligned: 4 (dim 1) != 6 (dim 0)

    Any ideas why does this happen?

    Originally posted by @wayfarer91-tog in https://github.com/MaxHalford/prince/issues/72#issuecomment-571749305

    opened by scardon1 0
  • Q: what is a good option for FA with Boolean Data?

    Q: what is a good option for FA with Boolean Data?

    Problem: Currently there is a binary data (Yes/No question dataset) that could benefit from dimensionality reduction, and be applied to feature selection and regression. The data is ves_data.csv.zip

    Currently there are options for doing this:

    • Boolean Matrix Factorization (to be tested https://github.com/LifanLiang/EM_BMF
    • Binary Polychoric Correlation Matrix (to be tested https://github.com/inuyasha2012/pypsy
    • Correlation Explanation (to be tested) https://github.com/gregversteeg/CorEx
    • Positive Pointwise Mutual Information (note: only good for implicit data) https://github.com/Bollegala/svdmi
    • Logistic PCA (note: native Python implementation is memory-consuming) https://github.com/brudfors/logistic-PCA-Tipping

    Some information to get the data started

    from pandas import read_csv
    table = read_csv('ves_data.csv')
    total = table[[i for i in table if ('MM01' in i and i not in [
        'MM01001','MM01BR','MM01003A','MM01003B',
        'MM01003C','MM010567','MM010568','MM010569','MM010570','MM010571',
        'MM010572','MM010573','MM010574','MM010575','MM010576','MM010577',
        'MM010578','MM010579','MM010580','MM010581']) or
        i in ["PA", "GIT", "AFQT", "WAIS_BD", "WAIS_GI", 'VERAW','ARRAW','VESS','ARSS']]]
    total = total[[i for i in table if ('MM01' in i and i not in [
        'MM01001','MM01BR','MM01003A','MM01003B',
        'MM01003C','MM010567','MM010568','MM010569','MM010570','MM010571',
        'MM010572','MM010573','MM010574','MM010575','MM010576','MM010577',
        'MM010578','MM010579','MM010580','MM010581']) or i in ['AFQT']]].dropna() # 'GIT' is good too
    
    from sklearn.utils import shuffle
    
    X, y = shuffle(total.drop(['AFQT'], axis=1), total['AFQT'], random_state=13)
    X = X - 1 # calibrating the range from 1~2 to 0~1
    X = X.to_numpy() # needed for some code to function
    
    opened by BrandonKMLee 0
  • ValueError: array must not contain infs or NaNs

    ValueError: array must not contain infs or NaNs

    I am utilizing this page (https://datascienceplus.com/parsing-html-and-applying-unsupervised-machine-learning-part-3-principal-component-analysis-pca-using-python/) for a project I am working on. I am running my python script on Visual Studio Code on a Windows 10 machine with the following versions: sklearn: 1.1.1 pip: 22.1.2 setuptools: 62.6.0 numpy: 1.23.1 scipy: 1.8.1 Cython: 0.29.28 pandas: 1.3.5 matplotlib: 3.5.2 joblib: 1.1.0 threadpoolctl: 3.0.0

    The first block of code under "Putting it Together" section of the article where it is:

    pca.plot_row_coordinates(
         df2[numerical_features],
         ax=None,
         figsize=(10, 8),
         x_component=0,
         y_component=1,
         labels=None,
         color_labels=df2['Kcluster'],
         ellipse_outline=True,
         ellipse_fill=True,
         show_points=True
     ).legend(loc='center left', bbox_to_anchor=(1, 0.5))
    

    gives the following error message despite the fact that none of the entries of the dataframe are NANs or INFs:

    C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\prince\plot.py:45: RuntimeWarning: Degrees of freedom <= 0 for slice
      cov_matrix = np.cov(np.vstack((X, Y)))
    C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\numpy\lib\function_base.py:2704: RuntimeWarning: divide by zero encountered in divide
      c *= np.true_divide(1, fact)
    C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\numpy\lib\function_base.py:2704: RuntimeWarning: invalid value encountered in multiply
      c *= np.true_divide(1, fact)
    Traceback (most recent call last):
      File "c:\Users\username\Desktop\observables\my_script.py", line 163, in <module>
        main()
      File "c:\Users\username\Desktop\observables\my_script.py", line 95, in main
        preprocessor(grand_df, pca)
      File "c:\Users\username\Desktop\observables\utilities.py", line 1815, in preprocessor
        ax = pca.plot_row_coordinates(df2[numerical_features],
      File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\prince\pca.py", line 238, in plot_row_coordinates
        x_mean, y_mean, width, height, angle = plot.build_ellipse(x[mask], y[mask])
      File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\prince\plot.py", line 46, in build_ellipse
        U, s, V = linalg.svd(cov_matrix, full_matrices=False)
      File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\scipy\linalg\_decomp_svd.py", line 108, in svd
        a1 = _asarray_validated(a, check_finite=check_finite)
      File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\scipy\_lib\_util.py", line 287, in _asarray_validated
        a = toarray(a)
      File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\numpy\lib\function_base.py", line 627, in asarray_chkfinite   
        raise ValueError(
    ValueError: array must not contain infs or NaNs
    

    Is there any workaround to account for the 0-degrees of freedom encountered in the calculation of covariance matrix?

    opened by Ashur59 1
  • col and row contributions as well as cos2 not working in correspondence analysis

    col and row contributions as well as cos2 not working in correspondence analysis

    row_contributions and col_contributions not functioning in the correspondence analysis portion. it says 'CA' object has no attribute 'row_contributions' ca issue

    opened by noemifasties 0
  • TypeError: '<' not supported between instances of 'str' and 'int'

    TypeError: '<' not supported between instances of 'str' and 'int'

    For famd.column_correlations(X), I'm getting an error TypeError: '<' not supported between instances of 'str' and 'int'. The fit and transform methods work fine.

    Removing .sort_index() in the column_correlations method of the MFA class seems to solve the issue, but would this have unintended effects?

    opened by slittle-twilio 1
Owner
Max Halford
Data scientist @alan-eu. PhD in applied machine learning. Kaggle Competitions Master when it was cool. Online machine learning nut. Blogging enthusiast.
Max Halford
AP1 Transcription Factor Binding Site Prediction

A machine learning project that predicted binding sites of AP1 transcription factor, using ChIP-Seq data and local DNA shape information.

null 1 Jan 21, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Dec 22, 2022
ArviZ is a Python package for exploratory analysis of Bayesian models

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, model checking, comparison and diagnostics

ArviZ 1.3k Jan 5, 2023
A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

Stox 31 Dec 16, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 23.3k Dec 31, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 2, 2023
Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

Facebook Research 4.1k Dec 29, 2022
A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

Daniel Formoso 5.7k Dec 30, 2022
customer churn prediction prevention in telecom industry using machine learning and survival analysis

Telco Customer Churn Prediction - Plotly Dash Application Description This dash application allows you to predict telco customer churn using machine l

Benaissa Mohamed Fayçal 3 Nov 20, 2021
MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

Machine Learning and Data Analytics Lab FAU 10 Dec 19, 2022
K-means clustering is a method used for clustering analysis, especially in data mining and statistics.

K Means Algorithm What is K Means This algorithm is an iterative algorithm that partitions the dataset according to their features into K number of pr

null 1 Nov 1, 2021
This repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

B DEVA DEEKSHITH 1 Nov 3, 2021
Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Trading Tesla with Machine Learning and Sentiment Analysis An interactive program to train a Random Forest Classifier to predict Tesla daily prices us

Renato Votto 31 Nov 17, 2022
Optimal Randomized Canonical Correlation Analysis

ORCCA Optimal Randomized Canonical Correlation Analysis This project is for the python version of ORCCA algorithm. It depends on Numpy for matrix calc

Yinsong Wang 1 Nov 21, 2021
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 35 Jan 2, 2023
PROTEIN EXPRESSION ANALYSIS FOR DOWN SYNDROME

PROTEIN-EXPRESSION-ANALYSIS-FOR-DOWN-SYNDROME Down syndrome (DS) is a chromosomal disorder where organisms have an extra chromosome 21, sometimes know

null 1 Jan 20, 2022
Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir

null 6 Jun 30, 2022
Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

Ross Taylor 2k Jan 2, 2023