Pandas integration with sklearn

Overview

Sklearn-pandas

https://circleci.com/gh/scikit-learn-contrib/sklearn-pandas.svg?style=svg

This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames. In particular, it provides a way to map DataFrame columns to transformations, which are later recombined into features.

Installation

You can install sklearn-pandas with pip:

# pip install sklearn-pandas

or conda-forge:

# conda install -c conda-forge sklearn-pandas

Tests

The examples in this file double as basic sanity tests. To run them, use doctest, which is included with python:

# python -m doctest README.rst

Usage

Import

Import what you need from the sklearn_pandas package. The choices are:

  • DataFrameMapper, a class for mapping pandas data frame columns to different sklearn transformations

For this demonstration, we will import both:

>>> from sklearn_pandas import DataFrameMapper

For these examples, we'll also use pandas, numpy, and sklearn:

>>> import pandas as pd
>>> import numpy as np
>>> import sklearn.preprocessing, sklearn.decomposition, \
...     sklearn.linear_model, sklearn.pipeline, sklearn.metrics
>>> from sklearn.feature_extraction.text import CountVectorizer

Load some Data

Normally you'll read the data from a file, but for demonstration purposes we'll create a data frame from a Python dict:

>>> data = pd.DataFrame({'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
...                      'children': [4., 6, 3, 3, 2, 3, 5, 4],
...                      'salary':   [90., 24, 44, 27, 32, 59, 36, 27]})

Transformation Mapping

Map the Columns to Transformations

The mapper takes a list of tuples. The first element of each tuple is a column name from the pandas DataFrame, or a list containing one or multiple columns (we will see an example with multiple columns later). The second element is an object which will perform the transformation which will be applied to that column. The third one is optional and is a dictionary containing the transformation options, if applicable (see "custom column names for transformed features" below).

Let's see an example:

>>> mapper = DataFrameMapper([
...     ('pet', sklearn.preprocessing.LabelBinarizer()),
...     (['children'], sklearn.preprocessing.StandardScaler())
... ])

The difference between specifying the column selector as 'column' (as a simple string) and ['column'] (as a list with one element) is the shape of the array that is passed to the transformer. In the first case, a one dimensional array will be passed, while in the second case it will be a 2-dimensional array with one column, i.e. a column vector.

This behaviour mimics the same pattern as pandas' dataframes __getitem__ indexing:

>>> data['children'].shape
(8,)
>>> data[['children']].shape
(8, 1)

Be aware that some transformers expect a 1-dimensional input (the label-oriented ones) while some others, like OneHotEncoder or Imputer, expect 2-dimensional input, with the shape [n_samples, n_features].

Test the Transformation

We can use the fit_transform shortcut to both fit the model and see what transformed data looks like. In this and the other examples, output is rounded to two digits with np.round to account for rounding errors on different hardware:

>>> np.round(mapper.fit_transform(data.copy()), 2)
array([[ 1.  ,  0.  ,  0.  ,  0.21],
       [ 0.  ,  1.  ,  0.  ,  1.88],
       [ 0.  ,  1.  ,  0.  , -0.63],
       [ 0.  ,  0.  ,  1.  , -0.63],
       [ 1.  ,  0.  ,  0.  , -1.46],
       [ 0.  ,  1.  ,  0.  , -0.63],
       [ 1.  ,  0.  ,  0.  ,  1.04],
       [ 0.  ,  0.  ,  1.  ,  0.21]])

Note that the first three columns are the output of the LabelBinarizer (corresponding to cat, dog, and fish respectively) and the fourth column is the standardized value for the number of children. In general, the columns are ordered according to the order given when the DataFrameMapper is constructed.

Now that the transformation is trained, we confirm that it works on new data:

>>> sample = pd.DataFrame({'pet': ['cat'], 'children': [5.]})
>>> np.round(mapper.transform(sample), 2)
array([[1.  , 0.  , 0.  , 1.04]])

Output features names

In certain cases, like when studying the feature importances for some model, we want to be able to associate the original features to the ones generated by the dataframe mapper. We can do so by inspecting the automatically generated transformed_names_ attribute of the mapper after transformation:

>>> mapper.transformed_names_
['pet_cat', 'pet_dog', 'pet_fish', 'children']

Custom column names for transformed features

We can provide a custom name for the transformed features, to be used instead of the automatically generated one, by specifying it as the third argument of the feature definition:

>>> mapper_alias = DataFrameMapper([
...     (['children'], sklearn.preprocessing.StandardScaler(),
...      {'alias': 'children_scaled'})
... ])
>>> _ = mapper_alias.fit_transform(data.copy())
>>> mapper_alias.transformed_names_
['children_scaled']

Alternatively, you can also specify prefix and/or suffix to add to the column name. For example:

>>> mapper_alias = DataFrameMapper([
...     (['children'], sklearn.preprocessing.StandardScaler(), {'prefix': 'standard_scaled_'}),
...     (['children'], sklearn.preprocessing.StandardScaler(), {'suffix': '_raw'})
... ])
>>> _ = mapper_alias.fit_transform(data.copy())
>>> mapper_alias.transformed_names_
['standard_scaled_children', 'children_raw']

Passing Series/DataFrames to the transformers

By default the transformers are passed a numpy array of the selected columns as input. This is because sklearn transformers are historically designed to work with numpy arrays, not with pandas dataframes, even though their basic indexing interfaces are similar.

However we can pass a dataframe/series to the transformers to handle custom cases initializing the dataframe mapper with input_df=True:

>>> from sklearn.base import TransformerMixin
>>> class DateEncoder(TransformerMixin):
...    def fit(self, X, y=None):
...        return self
...
...    def transform(self, X):
...        dt = X.dt
...        return pd.concat([dt.year, dt.month, dt.day], axis=1)
>>> dates_df = pd.DataFrame(
...     {'dates': pd.date_range('2015-10-30', '2015-11-02')})
>>> mapper_dates = DataFrameMapper([
...     ('dates', DateEncoder())
... ], input_df=True)
>>> mapper_dates.fit_transform(dates_df)
array([[2015,   10,   30],
       [2015,   10,   31],
       [2015,   11,    1],
       [2015,   11,    2]])

We can also specify this option per group of columns instead of for the whole mapper:

>>> mapper_dates = DataFrameMapper([
...     ('dates', DateEncoder(), {'input_df': True})
... ])
>>> mapper_dates.fit_transform(dates_df)
array([[2015,   10,   30],
       [2015,   10,   31],
       [2015,   11,    1],
       [2015,   11,    2]])

Outputting a dataframe

By default the output of the dataframe mapper is a numpy array. This is so because most sklearn estimators expect a numpy array as input. If however we want the output of the mapper to be a dataframe, we can do so using the parameter df_out when creating the mapper:

>>> mapper_df = DataFrameMapper([
...     ('pet', sklearn.preprocessing.LabelBinarizer()),
...     (['children'], sklearn.preprocessing.StandardScaler())
... ], df_out=True)
>>> np.round(mapper_df.fit_transform(data.copy()), 2)
   pet_cat  pet_dog  pet_fish  children
0        1        0         0      0.21
1        0        1         0      1.88
2        0        1         0     -0.63
3        0        0         1     -0.63
4        1        0         0     -1.46
5        0        1         0     -0.63
6        1        0         0      1.04
7        0        0         1      0.21

The names for the columns are the same ones present in the transformed_names_ attribute.

Note this does not work together with the default=True or sparse=True arguments to the mapper.

Dropping columns explictly

Sometimes it is required to drop a specific column/ list of columns. For this purpose, drop_cols argument for DataFrameMapper can be used. Default value is None

>>> mapper_df = DataFrameMapper([
...     ('pet', sklearn.preprocessing.LabelBinarizer()),
...     (['children'], sklearn.preprocessing.StandardScaler())
... ], drop_cols=['salary'])

Now running fit_transform will run transformations on 'pet' and 'children' and drop 'salary' column:

>>> np.round(mapper_df.fit_transform(data.copy()), 1)
array([[ 1. ,  0. ,  0. ,  0.2],
       [ 0. ,  1. ,  0. ,  1.9],
       [ 0. ,  1. ,  0. , -0.6],
       [ 0. ,  0. ,  1. , -0.6],
       [ 1. ,  0. ,  0. , -1.5],
       [ 0. ,  1. ,  0. , -0.6],
       [ 1. ,  0. ,  0. ,  1. ],
       [ 0. ,  0. ,  1. ,  0.2]])

Transformations may require multiple input columns. In these

Transform Multiple Columns

Transformations may require multiple input columns. In these cases, the column names can be specified in a list:

>>> mapper2 = DataFrameMapper([
...     (['children', 'salary'], sklearn.decomposition.PCA(1))
... ])

Now running fit_transform will run PCA on the children and salary columns and return the first principal component:

>>> np.round(mapper2.fit_transform(data.copy()), 1)
array([[ 47.6],
       [-18.4],
       [  1.6],
       [-15.4],
       [-10.4],
       [ 16.6],
       [ -6.4],
       [-15.4]])

Multiple transformers for the same column

Multiple transformers can be applied to the same column specifying them in a list:

>>> from sklearn.impute import SimpleImputer
>>> mapper3 = DataFrameMapper([
...     (['age'], [SimpleImputer(),
...                sklearn.preprocessing.StandardScaler()])])
>>> data_3 = pd.DataFrame({'age': [1, np.nan, 3]})
>>> mapper3.fit_transform(data_3)
array([[-1.22474487],
       [ 0.        ],
       [ 1.22474487]])

Columns that don't need any transformation

Only columns that are listed in the DataFrameMapper are kept. To keep a column but don't apply any transformation to it, use None as transformer:

>>> mapper3 = DataFrameMapper([
...     ('pet', sklearn.preprocessing.LabelBinarizer()),
...     ('children', None)
... ])
>>> np.round(mapper3.fit_transform(data.copy()))
array([[1., 0., 0., 4.],
       [0., 1., 0., 6.],
       [0., 1., 0., 3.],
       [0., 0., 1., 3.],
       [1., 0., 0., 2.],
       [0., 1., 0., 3.],
       [1., 0., 0., 5.],
       [0., 0., 1., 4.]])

Applying a default transformer

A default transformer can be applied to columns not explicitly selected passing it as the default argument to the mapper:

>>> mapper4 = DataFrameMapper([
...     ('pet', sklearn.preprocessing.LabelBinarizer()),
...     ('children', None)
... ], default=sklearn.preprocessing.StandardScaler())
>>> np.round(mapper4.fit_transform(data.copy()), 1)
array([[ 1. ,  0. ,  0. ,  4. ,  2.3],
       [ 0. ,  1. ,  0. ,  6. , -0.9],
       [ 0. ,  1. ,  0. ,  3. ,  0.1],
       [ 0. ,  0. ,  1. ,  3. , -0.7],
       [ 1. ,  0. ,  0. ,  2. , -0.5],
       [ 0. ,  1. ,  0. ,  3. ,  0.8],
       [ 1. ,  0. ,  0. ,  5. , -0.3],
       [ 0. ,  0. ,  1. ,  4. , -0.7]])

Using default=False (the default) drops unselected columns. Using default=None pass the unselected columns unchanged.

Same transformer for the multiple columns

Sometimes it is required to apply the same transformation to several dataframe columns. To simplify this process, the package provides gen_features function which accepts a list of columns and feature transformer class (or list of classes), and generates a feature definition, acceptable by DataFrameMapper.

For example, consider a dataset with three categorical columns, 'col1', 'col2', and 'col3', To binarize each of them, one could pass column names and LabelBinarizer transformer class into generator, and then use returned definition as features argument for DataFrameMapper:

>>> from sklearn_pandas import gen_features
>>> feature_def = gen_features(
...     columns=['col1', 'col2', 'col3'],
...     classes=[sklearn.preprocessing.LabelEncoder]
... )
>>> feature_def
[('col1', [LabelEncoder()], {}), ('col2', [LabelEncoder()], {}), ('col3', [LabelEncoder()], {})]
>>> mapper5 = DataFrameMapper(feature_def)
>>> data5 = pd.DataFrame({
...     'col1': ['yes', 'no', 'yes'],
...     'col2': [True, False, False],
...     'col3': ['one', 'two', 'three']
... })
>>> mapper5.fit_transform(data5)
array([[1, 1, 0],
       [0, 0, 2],
       [1, 0, 1]])

If it is required to override some of transformer parameters, then a dict with 'class' key and transformer parameters should be provided. For example, consider a dataset with missing values. Then the following code could be used to override default imputing strategy:

>>> from sklearn.impute import SimpleImputer
>>> import numpy as np
>>> feature_def = gen_features(
...     columns=[['col1'], ['col2'], ['col3']],
...     classes=[{'class': SimpleImputer, 'strategy':'most_frequent'}]
... )
>>> mapper6 = DataFrameMapper(feature_def)
>>> data6 = pd.DataFrame({
...     'col1': [np.nan, 1, 1, 2, 3],
...     'col2': [True, False, np.nan, np.nan, True],
...     'col3': [0, 0, 0, np.nan, np.nan]
... })
>>> mapper6.fit_transform(data6)
array([[1.0, True, 0.0],
       [1.0, False, 0.0],
       [1.0, True, 0.0],
       [2.0, True, 0.0],
       [3.0, True, 0.0]], dtype=object)

You can also specify global prefix or suffix for the generated transformed column names using the prefix and suffix parameters:

>>> feature_def = gen_features(
...     columns=['col1', 'col2', 'col3'],
...     classes=[sklearn.preprocessing.LabelEncoder],
...     prefix="lblencoder_"
... )
>>> mapper5 = DataFrameMapper(feature_def)
>>> data5 = pd.DataFrame({
...     'col1': ['yes', 'no', 'yes'],
...     'col2': [True, False, False],
...     'col3': ['one', 'two', 'three']
... })
>>> _ = mapper5.fit_transform(data5)
>>> mapper5.transformed_names_
['lblencoder_col1', 'lblencoder_col2', 'lblencoder_col3']

Feature selection and other supervised transformations

DataFrameMapper supports transformers that require both X and y arguments. An example of this is feature selection. Treating the 'pet' column as the target, we will select the column that best predicts it.

>>> from sklearn.feature_selection import SelectKBest, chi2
>>> mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=1))])
>>> mapper_fs.fit_transform(data[['children','salary']], data['pet'])
array([[90.],
       [24.],
       [44.],
       [27.],
       [32.],
       [59.],
       [36.],
       [27.]])

Working with sparse features

A DataFrameMapper will return a dense feature array by default. Setting sparse=True in the mapper will return a sparse array whenever any of the extracted features is sparse. Example:

>>> mapper5 = DataFrameMapper([
...     ('pet', CountVectorizer()),
... ], sparse=True)
>>> type(mapper5.fit_transform(data))
<class 'scipy.sparse.csr.csr_matrix'>

The stacking of the sparse features is done without ever densifying them.

Using NumericalTransformer

While you can use FunctionTransformation to generate arbitrary transformers, it can present serialization issues when pickling. Use NumericalTransformer instead, which takes the function name as a string parameter and hence can be easily serialized.

>>> from sklearn_pandas import NumericalTransformer
>>> mapper5 = DataFrameMapper([
...     ('children', NumericalTransformer('log')),
... ])
>>> mapper5.fit_transform(data)
array([[1.38629436],
       [1.79175947],
       [1.09861229],
       [1.09861229],
       [0.69314718],
       [1.09861229],
       [1.60943791],
       [1.38629436]])

Changing Logging level

You can change log level to info to print time take to fit/transform features. Setting it to higher level will stop printing elapsed time. Below example shows how to change logging level.

>>> import logging
>>> logging.getLogger('sklearn_pandas').setLevel(logging.INFO)

Changelog

2.1.0 (2021-02-26)

  • Removed test for Python 3.6 and added Python 3.9
  • Added deprecation warning for NumericalTransformer
  • Fixed pickling issue causing integration issues with Baikal.
  • Started publishing package to conda repo

2.0.4 (2020-11-06)

  • Explicitly handling serialization (#224)
  • document fixes
  • Making transform function thread safe (#194)
  • Switched to nox for unit testing (#226)

2.0.3 (2020-11-06)

  • Added elapsed time information for each feature.

2.0.2 (2020-10-01)

  • Fix DataFrameMapper drop_cols attribute naming consistency with scikit-learn and initialization.

2.0.1 (2020-09-07)

  • Added an option to explicitly drop columns.

2.0.0 (2020-08-01)

  • Deprecated support for Python < 3.6.
  • Deprecated support for old versions of scikit-learn, pandas and numpy. Please check setup.py for minimum requirement.
  • Removed CategoricalImputer, cross_val_score and GridSearchCV. All these functionality now exists as part of scikit-learn. Please use SimpleImputer instead of CategoricalImputer. Also Cross validation from sklearn now supports dataframe so we don't need to use cross validation wrapper provided over here.
  • Added NumericalTransformer for common numerical transformations. Currently it implements log and log1p transformation.
  • Added prefix and suffix options. See examples above. These are usually helpful when using gen_features.
  • Added drop_cols argument to DataframeMapper. This can be used to explicitly drop columns

1.8.0 (2018-12-01)

  • Add FunctionTransformer class (#117).
  • Fix column names derivation for dataframes with multi-index or non-string columns (#166).
  • Change behaviour of DataFrameMapper's fit_transform method to invoke each underlying transformers' native fit_transform if implemented (#150).

1.7.0 (2018-08-15)

  • Fix issues with unicode names in get_names (#160).
  • Update to build using numpy==1.14 and python==3.6 (#154).
  • Add strategy and fill_value parameters to CategoricalImputer to allow imputing with values other than the mode (#144),(#161).
  • Preserve input data types when no transform is supplied (#138).

1.6.0 (2017-10-28)

  • Add column name to exception during fit/transform (#110).
  • Add gen_feature helper function to help generating the same transformation for multiple columns (#126).

1.5.0 (2017-06-24)

  • Allow inputting a dataframe/series per group of columns.
  • Get feature names also from estimator.get_feature_names() if present.
  • Attempt to derive feature names from individual transformers when applying a list of transformers.
  • Do not mutate features in __init__ to be compatible with sklearn>=0.20 (#76).

1.4.0 (2017-05-13)

  • Allow specifying a custom name (alias) for transformed columns (#83).
  • Capture output columns generated names in transformed_names_ attribute (#78).
  • Add CategoricalImputer that replaces null-like values with the mode for string-like columns.
  • Add input_df init argument to allow inputting a dataframe/series to the transformers instead of a numpy array (#60).

1.3.0 (2017-01-21)

  • Make the mapper return dataframes when df_out=True (#70, #74).
  • Update imports to avoid deprecation warnings in sklearn 0.18 (#68).

1.2.0 (2016-10-02)

  • Deprecate custom cross-validation shim classes.
  • Require scikit-learn>=0.15.0. Resolves #49.
  • Allow applying a default transformer to columns not selected explicitly in the mapper. Resolves #55.
  • Allow specifying an optional y argument during transform for supervised transformations. Resolves #58.

1.1.0 (2015-12-06)

  • Delete obsolete PassThroughTransformer. If no transformation is desired for a given column, use None as transformer.
  • Factor out code in several modules, to avoid having everything in __init__.py.
  • Use custom TransformerPipeline class to allow transformation steps accepting only a X argument. Fixes #46.
  • Add compatibility shim for unpickling mappers with list of transformers created before 1.0.0. Fixes #45.

1.0.0 (2015-11-28)

  • Change version numbering scheme to SemVer.
  • Use sklearn.pipeline.Pipeline instead of copying its code. Resolves #43.
  • Raise KeyError when selecting unexistent columns in the dataframe. Fixes #30.
  • Return sparse feature array if any of the features is sparse and sparse argument is True. Defaults to False to avoid potential breaking of existing code. Resolves #34.
  • Return model and prediction in custom CV classes. Fixes #27.

0.0.12 (2015-11-07)

  • Allow specifying a list of transformers to use sequentially on the same column.

Credits

The code for DataFrameMapper is based on code originally written by Ben Hamner.

Other contributors:

  • Ariel Rossanigo (@arielrossanigo)
  • Arnau Gil Amat (@arnau126)
  • Assaf Ben-David (@AssafBenDavid)
  • Brendan Herger (@bjherger)
  • Cal Paterson (@calpaterson)
  • @defvorfu
  • Floris Hoogenboom (@FlorisHoogenboom)
  • Gustavo Sena Mafra (@gsmafra)
  • Israel Saeta Pérez (@dukebody)
  • Jeremy Howard (@jph00)
  • Jimmy Wan (@jimmywan)
  • Kristof Van Engeland (@kristofve91)
  • Olivier Grisel (@ogrisel)
  • Paul Butler (@paulgb)
  • Richard Miller (@rwjmiller)
  • Ritesh Agrawal (@ragrawal)
  • @SandroCasagrande
  • Timothy Sweetser (@hacktuarial)
  • Vitaley Zaretskey (@vzaretsk)
  • Zac Stewart (@zacstewart)
  • Parul Singh (@paro1234)
  • Vincent Heusinkveld (@VHeusinkveld)
Comments
  • DeprecationWarning with sklearn.GridSearchCV

    DeprecationWarning with sklearn.GridSearchCV

    When using the library together with sklearn.GridSearchCV I am getting the deprecation warning below:

    /home/marcelo/.local/lib/python2.7/site-packages/sklearn/base.py:122: DeprecationWarning: Estimator DataFrameMapper modifies parameters in __init__. This behavior is deprecated as of 0.18 and support for this behavior will be removed in 0.20. % type(estimator).__name__, DeprecationWarning)

    Since the behavior will change soon, I would check if it is something relevant to this library.

    opened by MSardelich 15
  • Tests failing

    Tests failing

    I'm seeing a bunch of tests fail. I'm on Windows 7 with Python 2.7.5 via Anaconda 1.6.2 (64-bit).

    C:\Anaconda\Lib\site-packages>python -m doctest README.rst
    **********************************************************************
    File "README.rst", line 75, in README.rst
    Failed example:
        mapper.fit_transform(data)
    Exception raised:
        Traceback (most recent call last):
          File "C:\Anaconda\lib\doctest.py", line 1289, in __run
            compileflags, 1) in test.globs
          File "<doctest README.rst[5]>", line 1, in <module>
            mapper.fit_transform(data)
          File "sklearn\base.py", line 408, in fit_transform
            return self.fit(X, **fit_params).transform(X)
          File "sklearn_pandas\__init__.py", line 46, in fit
            transformer.fit(X[columns])
          File "sklearn\preprocessing\label.py", line 241, in fit
            self.classes_ = unique_labels(y)
          File "sklearn\utils\multiclass.py", line 98, in unique_labels
            raise ValueError("Unknown label type")
        ValueError: Unknown label type
    **********************************************************************
    File "README.rst", line 89, in README.rst
    Failed example:
        mapper.transform({'pet': ['cat'], 'children': [5.]})
    Exception raised:
        Traceback (most recent call last):
          File "C:\Anaconda\lib\doctest.py", line 1289, in __run
            compileflags, 1) in test.globs
          File "<doctest README.rst[6]>", line 1, in <module>
            mapper.transform({'pet': ['cat'], 'children': [5.]})
          File "sklearn_pandas\__init__.py", line 52, in transform
            fea = transformer.transform(X[columns])
          File "sklearn\preprocessing\label.py", line 261, in transform
            self._check_fitted()
          File "sklearn\preprocessing\label.py", line 221, in _check_fitted
            raise ValueError("LabelBinarizer was not fitted yet.")
        ValueError: LabelBinarizer was not fitted yet.
    **********************************************************************
    File "README.rst", line 103, in README.rst
    Failed example:
        mapper2.fit_transform(data)
    Expected:
        array([[ 47.62288153],
               [-18.38596516],
               [  1.62873661],
               [-15.3709553 ],
               [-10.36602451],
               [ 16.62846476],
               [ -6.38116123],
               [-15.37597671]])
    Got:
        array([[ 47.62195051],
               [-18.39077736],
               [  1.63037658],
               [-15.36917967],
               [-10.36208485],
               [ 16.62998504],
               [ -6.38386526],
               [-15.376405  ]])
    **********************************************************************
    File "README.rst", line 123, in README.rst
    Failed example:
        cross_val_score(pipe, data, data.salary, sklearn.metrics.mean_squared_error)
    
    Exception raised:
        Traceback (most recent call last):
          File "C:\Anaconda\lib\doctest.py", line 1289, in __run
            compileflags, 1) in test.globs
          File "<doctest README.rst[10]>", line 1, in <module>
            cross_val_score(pipe, data, data.salary, sklearn.metrics.mean_squared_er
    ror)
          File "sklearn_pandas\__init__.py", line 34, in cross_val_score
            return cross_validation.cross_val_score(df, X_indices, *args, **kwargs)
          File "sklearn\cross_validation.py", line 1152, in cross_val_score
            for train, test in cv)
          File "sklearn\externals\joblib\parallel.py", line 517, in __call__
            self.dispatch(function, args, kwargs)
          File "sklearn\externals\joblib\parallel.py", line 312, in dispatch
            job = ImmediateApply(func, args, kwargs)
          File "sklearn\externals\joblib\parallel.py", line 136, in __init__
            self.results = func(*args, **kwargs)
          File "sklearn\cross_validation.py", line 1060, in _cross_val_score
            estimator.fit(X_train, y_train, **fit_params)
          File "sklearn_pandas\__init__.py", line 19, in fit
            self.estimator.fit(self._get_row_subset(x), y)
          File "sklearn\pipeline.py", line 130, in fit
            Xt, fit_params = self._pre_transform(X, y, **fit_params)
          File "sklearn\pipeline.py", line 120, in _pre_transform
            Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
          File "sklearn\base.py", line 411, in fit_transform
            return self.fit(X, y, **fit_params).transform(X)
          File "sklearn_pandas\__init__.py", line 46, in fit
            transformer.fit(X[columns])
          File "sklearn\preprocessing\label.py", line 241, in fit
            self.classes_ = unique_labels(y)
          File "sklearn\utils\multiclass.py", line 98, in unique_labels
            raise ValueError("Unknown label type")
        ValueError: Unknown label type
    **********************************************************************
    1 items had failures:
       4 of  11 in README.rst
    ***Test Failed*** 4 failures.
    
    opened by tdhopper 15
  • Name columns

    Name columns

    It would be nice to have a way to specify the names of columns that are created by a transform, such that later on you could pass mapper.names (or similar) to any functions that expect a list of column names (eg variable importance) or for use in any charts where you would want to label the columns with their names.

    This could default to the name of the pandas column that created it (if there's only one input and output) or the input columns joined with '_' if there's multiple inputs, and the name concatenated with '_1', '_2' etc if there's multiple outputs.

    opened by jph00 14
  • multiple columns alias name is not effect!

    multiple columns alias name is not effect!

    import pandas as pd
    import numpy as np
    import sklearn.preprocessing, sklearn.decomposition, \
        sklearn.linear_model, sklearn.pipeline, sklearn.metrics
    from sklearn_pandas import DataFrameMapper
    from sklearn_pandas import CategoricalImputer
    
    data = pd.DataFrame({'pet':      ['cat', 'dog', 'dog', 'fish', None, 'dog', 'cat', 'fish'],
                          'children': [4., 6, 3, 3, 2, 3, 5, 4],
                          'salary':   [90, 24, 44, 27, 32, 59, 36, 27],
                         'age':[12,24,21,17,18,25,19,15]})
    
    mapper = DataFrameMapper([
         ('pet', [CategoricalImputer(), sklearn.preprocessing.LabelBinarizer()]),
         (['children', 'salary'], sklearn.preprocessing.StandardScaler(), {'alias': 'children_scaled',
                                                                           'alias1':'salary_scaled'})
    
     ], df_out=True, default=None)
    mapper.fit_transform(data.copy())
    print(mapper.transformed_names_)
    

    ['pet_cat', 'pet_dog', 'pet_fish', 'children_scaled_0', 'children_scaled_1', 'age']

    good first issue 
    opened by bifeng 13
  • GridSearchCV extremely slow with DataFrameMapper?

    GridSearchCV extremely slow with DataFrameMapper?

    I have a dataframe, not particularly large (~3000 rows, 250 cols) on which I do the following:

    df = ...
    obj_cols = [(c, LabelBinarizer()) for c in X.columns if X.dtypes[c]=='O']
    num_cols = [(c, StandardScaler()) for c in X.columns if X.dtypes[c]<>'O']
    param_grid = {
      'clf__loss': ['hinge', 'log', 'modified_huber'],
      'clf__penalty': ('l1', 'l2', 'elasticnet'),
    }
    
    pipeline = sklearn.pipeline.Pipeline([ 
      ('mapper', sklearn_pandas.DataFrameMapper(obj_cols+num_cols)),
      ('clf', SGDClassifier()),
    ])
    
    grid_search = sklearn_pandas.GridSearchCV(pipeline, param_grid)
    grid_search.fit(df[data], df[target]) # this is REALLY slow
    

    From a quick glance, it seems to spend all its time indexing dataframe objects. The following 2 pieces of code are very fast:

    for params in ParameterGrid(param_grid):
      pipeline.set_params(params)
      X_train, y_train, X_test, y_test = sklearn.cross_validation.train_test_split(df[data],df[target])
      pipeline.fit(X_train, y_train)
      score = pipeline.score(X_test, y_test)
    
    X=mapper.fit_transform(df[data], y)
    pipeline = Pipeline([ ('clf',SGDClassifier()) ])
    grid_search = sklearn.cross_validation.GridSearchCV(pipeline,param_grid)
    grid_search.fit(X,y)
    

    So it must be something to do with using GridSearchCV with the DataFrameMapper. Any ideas?

    More generally, is there a better way to handle categorical variables?

    enhancement 
    opened by andytwigg 13
  • DeprecationWarning for cross_validation

    DeprecationWarning for cross_validation

    In addition to #76, also the sklearn.cross_validation package got deprecated:

    In [1]: import sklearn_pandas
    /home/maxnoe/.local/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
      "This module will be removed in 0.20.", DeprecationWarning)
    /home/maxnoe/.local/anaconda3/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
      DeprecationWarning)
    
    opened by maxnoe 10
  • Patch which makes the module work in Python 3

    Patch which makes the module work in Python 3

    Here is a patch which makes the module work in Python 3.. isinstance(basestring) raises a NameError in Python 3. because type of basestring is not supported.

    opened by amacbee 10
  • gen_features failing with SimpleImputer on bool column

    gen_features failing with SimpleImputer on bool column

    Hi

    I'm working through the the https://github.com/scikit-learn-contrib/sklearn-pandas/blob/master/README.rst.

    I'm getting a deprecation warning on this code

    feature_def = gen_features(
        columns=[['col1'], ['col2'], ['col3']],
        classes=[{'class': sklearn.preprocessing.Imputer, 'strategy': 'most_frequent'}]
    )
    mapper6 = DataFrameMapper(feature_def)
    data6 = pd.DataFrame({
        'col1': [None, 1, 1, 2, 3],
        'col2': [True, False, None, None, True],
        'col3': [0, 0, 0, None, None]
    })
    mapper6.fit_transform(data6)
    

    So I replaced it with

    feature_def = gen_features(
        columns=[['col1'], ['col2'], ['col3']],
        classes=[{'class': sklearn.impute.SimpleImputer, 'strategy': 'most_frequent'}]
    )
    mapper6 = DataFrameMapper(feature_def)
    

    But this fails with

    TypeError: ['col2']: unorderable types: NoneType() < bool()

    So I replaced gen_features with an explicit DataFrameMapper and it works

    mapper6 = DataFrameMapper([
        (['col1'], sklearn.impute.SimpleImputer(), {'strategy': 'most_frequent'}),
        (['col2'], sklearn.impute.SimpleImputer(), {'strategy': 'most_frequent'}),
        (['col3'], sklearn.impute.SimpleImputer(), {'strategy': 'most_frequent'}),
    ])
    data6 = pd.DataFrame({
        'col1': [None, 1, 1, 2, 3],
        'col2': [True, False, None, None, True],
        'col3': [0, 0, 0, None, None]
    })
    mapper6.fit_transform(data6)
    

    As far as I can see the explicit DataFrameMapper should be the same as the one built by ataFrameMapper(feature_def). Have I done something wrong or is the a bug in gen_features?

    opened by david-waterworth 9
  •  ValueError: Logistic Regression supports only penalties in ['l1', 'l2'], got none.

    ValueError: Logistic Regression supports only penalties in ['l1', 'l2'], got none.

    from sklearn.linear_model import LogisticRegression
    lr = LogisticRegression(penalty='none',solver='saga')
    lr.fit(X_train1, y_train1)
    

    It will report the error: ValueError: Logistic Regression supports only penalties in ['l1', 'l2'], got none.

    I dont know why i cant input parameter:penalty='none'

    opened by sunrisehang 8
  • Add DataFrameMapper feature metadata and y-value support.

    Add DataFrameMapper feature metadata and y-value support.

    This pull request:

    • Enhances DataFrameMapper object with (optional) support for y-value extraction from input frames in fit* methods. Ref discussion in #41. @naught101 @dukebody
    • Adds features->X metadata properties, to support association of estimator metadata to source columns. Ref #13. @sveitser @dukebody
    • Adds a Pipeline subclass, DataFramePipeline, to support transparent extraction of y values during fitting via an input DataFrameMapper.

    Example Notebook

    I'm absolutely open to code-review and discussion of the proposed interfaces before merging.

    TODO

    • [ ] Add test coverage for added features.
    • [ ] Clarify y-value extraction semantics with respect to cases where the y_feature column may be present in both input X and y frame inputs.
    • [ ] Clarify acceptable y inputs. Series & DataFrame support?
    • [ ] Move _dataframe_mapper and _final_estimator properties in DataFramePipeline to public properties?
    • [ ] Add helper function to support passing DataFramePipelines into sklearn cross-validation functions with extracted X and y values?
    opened by asford 8
  • Bugfix: attribute naming consistency, drop_cols

    Bugfix: attribute naming consistency, drop_cols

    Within the scikit-learn ecosystem, it is standard practice to name the attributes like the init arguments. In this way the get_parms method can get the attributes. From sklearn version 24 onwards this is required behavior.

    opened by VHeusinkveld 7
  • Bug when using sklearn's make_column_selector & default=None

    Bug when using sklearn's make_column_selector & default=None

    Hi,

    when calling DataFrameMapper with

    1. default = None
    2. The feature selection specified using sklearn's make_column_selector,

    The output of transform() is incorrect, as it passes through ALL columns, not only the ones unaffected by the transformation - as is intended.

    First, I have monkeypatched sklearn-pandas to insert some prints into:

        def _unselected_columns(self, X):
            """
            Return list of columns present in X and not selected explicitly in the
            mapper.
    
            Unselected columns are returned in the order they appear in the
            dataframe to avoid issues with different ordering during default fit
            and transform steps.
            """
            X_columns = list(X.columns)
            
            unselected = [column for column in X_columns if
                    column not in self._selected_columns
                    and column not in self.drop_cols]
            
            print(f"Selected: {list(self._selected_columns)}")
            print(f"Unselected: {unselected}")
    
            return unselected
    

    First, if the columns for the feature are passed as a list of strings (by resolving the column_selector directly), everything works as it should:

    from sklearn.datasets import fetch_openml
    from sklearn_pandas import DataFrameMapper
    
    X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
    dtype_selection = ["category", "object"]
    
    categorical_features = make_column_selector(dtype_include=dtype_selection)
    categorical_features = categorical_features(X)
    print(f"{categorical_features=}, {type(categorical_features)=}")
    
    mapper =  DataFrameMapper(
            [(
                categorical_features,
                DebugTransformer()     # Does nothing, returns self in fit(), returns input X in transform()
            )],
            df_out=True,
            input_df=True,
            default = None,           
     )
    
     mapper.fit(X, y)
     out = mapper.transform(X)
     print("Output columns: ", out.columns)
    
    >>>Selector: categorical_features=['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest'], type(categorical_features)=<class 'list'>
    >>>Selected: ['embarked', 'boat', 'home.dest', 'sex', 'ticket', 'name', 'cabin']
    >>>Unselected: ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']
    >>>Output columns: ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'body']
    

    However, if the column_selector is passed to the features, tuple, the output is incorrect. The selected columns are duplicated in the final output:

    from sklearn.datasets import fetch_openml
    from sklearn_pandas import DataFrameMapper
    
    X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
    dtype_selection = ["category", "object"]
    
    categorical_features = make_column_selector(dtype_include=dtype_selection)
    print(f"{categorical_features=}, {type(categorical_features)=}")
    
    mapper =  DataFrameMapper(
            [(
                categorical_features,
                DebugTransformer()     # Does nothing, returns self in fit(), returns input X in transform()
            )],
            df_out=True,
            input_df=True,
            default = None,           
     )
    
     mapper.fit(X, y)
     out = mapper.transform(X)
     print("Output columns: ", out.columns)
    >>>Selector: categorical_features=<sklearn.compose._column_transformer.make_column_selector object at 0x0000021FC6CB47F0>, type(categorical_features)=<class 'sklearn.compose._column_transformer.make_column_selector'>
    >>>Selected: [<sklearn.compose._column_transformer.make_column_selector object at 0x0000021FC6B0F7C0>]
    >>>Unselected: ['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']
    >>>Output columns: ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']
    
    

    This happens because the selected columns are not filtered out properly in the _unselected_columns method, in this part:

    unselected = [column for column in X_columns if
                    column not in self._selected_columns   # <-- this outputs a function, 'not in' won't work.
                    and column not in self.drop_cols]
    

    where the selector function is added as a column during the _selected_columns property in the line selected_columns.add(columns).

    To solve this, add handling for sklearn's column_selector either within _unselected_columns or the property _selected_columns. A temporary solution for others that encountered this: directly use the dtype column_selector on your input dataframe, then only pass columns to the features parameter as lists, etc.

    opened by StochasticBoris 0
  • Unexpected Dropping of columns

    Unexpected Dropping of columns

    In the following lines the resulting prints do not change if the line drop_cols=["salary"] is commented out:

    import sklearn.preprocessing
    import pandas as pd
    import sklearn_pandas
    
    
    data = pd.DataFrame(
        {
            "pet": ["cat", "dog", "dog", "fish", "cat", "dog", "cat", "fish"],
            "children": [4.0, 6, 3, 3, 2, 3, 5, 4],
            "salary": [90.0, 24, 44, 27, 32, 59, 36, 27],
        }
    )
    
    mapper = sklearn_pandas.DataFrameMapper(
        [
            ("pet", sklearn.preprocessing.LabelBinarizer()),
            (["children"], sklearn.preprocessing.StandardScaler()),
        ],
        input_df=True,
        df_out=True,
        drop_cols=["salary"],
    )
    
    print(data)
    print()
    print(mapper.fit_transform(data.copy()))
    

    In both the uncommented and the commented case there is no salary column in the transformed dataframe. I would have expected that unmentioned columns are not touched, especially since the drop_cols option exists.

    Is this just me having arbitrary expectations or is there something strange going on?

    opened by minorchange 0
  • License clarification

    License clarification

    I'm a little confused on what the license is supposed to be.

    With version 2.1.0, MIT is listed as the license in setup.py but the license file states Zlib. Is the python package actually under MIT and the license file is wrong or is MIT incorrect and it should be Zlib? htt ps://pypi.org/project/sklearn-pandas/

    I also noticed that sklearn-pandas states BSD-3-Claues on conda-forge. Is this correct? https://anaconda.org/conda-forge/sklearn-pandas

    opened by capfei 0
  • Multiple Transforms for Multiple Columns

    Multiple Transforms for Multiple Columns

    Hello,

    I wanted to know is there any way to do multiple transforms on multiple columns , treating each one seperately.

    I was able to implement it using Sklearn's ColumnTransformer as follows:

    
    ct = ColumnTransformer(
        [(
            'numeric',
            Pipeline([
                ('handle_na',NAHandler(is_train=True,nan_cols=[])),
                ('standardize',StandardScaler()),
                ('PCA',PCA(n_components=4))
            ]),
                ['col1','col2','col3','col4']
        
        ),
      )],
      remainder='passthrough'
    )
    

    However SKlearn pandas' documentation doesnt point me to something like this. I can see there are 2 sections --- one for single column , multiple transforms and other for multiple colums, single transform.

    I couldnt see multiple cols multiple transforms

    for now i am able to do what i intend by writing transforms for each and every column seperately . i.e

    
    mapper = DataFrameMapper(
            [
                    (
                            ['col1'],
                            [NAHandler(is_train=True,nan_cols=[]),StandardScaler(),PCA(n_components=4)]
                    ),
                    (
                            ['col2'],
                            [NAHandler(is_train=True,nan_cols=[]),StandardScaler(),PCA(n_components=4)]
                    )
    
                   ........
            ],
            input_df=True,
            df_out=True,
            default=None
            )
    

    But what i was actually looking for is ColumnTransformer- like usage .

    something like this :

    mapper = DataFrameMapper(
            [
                    (
                            ['col1','col2','col3','col4'],
                            [NAHandler(is_train=True,nan_cols=[]),StandardScaler(),PCA(n_components=4)]
                    )
            ],
            input_df=True,
            df_out=True,
            default=None
            )
    
    

    will such functionality be supported in upcoming builds? Can be very helpful!

    opened by adiv5 1
  • DataFrameMapper - pass custom callable function and preserve columns names

    DataFrameMapper - pass custom callable function and preserve columns names

    As mentioned in Dynamic Columns section of the documentation, DataFrameMapper supports selecting columns dynamically during the fit operation by passing a custom callable or using sklearn.compose.make_column_selector. I've tried doing so, and the behavior I've experienced is that the columns names are replaced with column index (a number), which is also the expected behavior according to the documentation:

    class GetColumnsStartingWith: ... def init(self, start_str): ... self.pattern = start_str ... ... def call(self, X:pd.DataFrame=None): ... return [c for c in X.columns if c.startswith(self.pattern)] ... df = pd.DataFrame({ ... 'sepal length (cm)': [1.0, 2.0, 3.0], ... 'sepal width (cm)': [1.0, 2.0, 3.0], ... 'petal length (cm)': [1.0, 2.0, 3.0], ... 'petal width (cm)': [1.0, 2.0, 3.0] ... }) t = DataFrameMapper([ ... ( ... sklearn.compose.make_column_selector(dtype_include=float), ... sklearn.preprocessing.StandardScaler(), ... {'alias': 'x'} ... ), ... ( ... GetColumnsStartingWith('petal'), ... None, ... {'alias': 'petal'} ... )], df_out=True, default=False) t.fit(df).transform(df).shape (3, 6) t.transformed_names_ ['x_0', 'x_1', 'x_2', 'x_3', 'petal_0', 'petal_1']

    I would like to know how can I select columns dynamically (e.g. by dtype) while preserving their names.

    opened by roei-simplex 0
  • “python_requires” should be set with “>=3.6”, as sklearn-pandas 2.2.0 is not compatible with all Python versions

    “python_requires” should be set with “>=3.6”, as sklearn-pandas 2.2.0 is not compatible with all Python versions

    Currently, the keyword argument python_requires of setup() is not set, and thus it is assumed that this distribution is compatible with all Python versions. However, I found it is not compatible with Python<3.6. My local Python version is 3.5, and I encounter the following error when executing “pip install sklearn-pandas”

    ERROR: Could not find a version that satisfies the requirement scikit-learn>=0.23.0 (from sklearn-pandas) (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0b1, 0.15.0b2, 0.15.0, 0.15.1, 0.15.2, 0.16b1, 0.16.0, 0.16.1, 0.17b1, 0.17, 0.17.1, 0.18rc2, 0.18, 0.18.1, 0.18.2, 0.19b2, 0.19.0, 0.19.1, 0.19.2, 0.20rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21rc2, 0.21.0, 0.21.1, 0.21.2, 0.21.3, 0.22rc2.post1, 0.22rc3, 0.22, 0.22.1, 0.22.2.post1)  
    ERROR: No matching distribution found for scikit-learn>=0.23.0 (from sklearn-pandas)
    

    Dependencies of this distribution are listed as follows:

    'scikit-learn>=0.23.0',
    'scipy>=1.5.1',
    'pandas>=1.1.4',
    'numpy>=1.18.1'
    

    I found that scikit-learn>=0.23.0, scipy>=1.5.1 and pandas>=1.14 require Python>=3.6, which results in installation failure of sklearn-pandas in Python3.5.

    Way to fix: modify setup() in setup.py, add python_requires keyword argument:

    setup(…
         python_requires='>=3.6'
         …)
    

    Thanks for your attention. Best regrads, PyVCEchecker

    opened by PyVCEchecker 1
Releases(v2.2.0)
  • v2.2.0(May 8, 2021)

    [New]

    • The first element of the tuple can be one of the following: string, list, or any other callable function. Check out "dynamic column names" section for more information (https://github.com/scikit-learn-contrib/sklearn-pandas#dynamic-columns)
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(May 8, 2021)

    • Officially added support for Python 3.9 and deprecated support for Python 3.9
    • Added deprecation warning for NumericalTransformer. NumericalTransformer will be removed in the next release.
    • Fixed pickling issue causing integration issues with Baikal.
    • Started publishing package to conda repo
    Source code(tar.gz)
    Source code(zip)
  • v2.0.4(Jan 6, 2021)

    • Explicitly handling serialization (#224)
    • document fixes
    • Making transform function thread-safe (#194)
    • Switched to nox for unit testing (#226)
    Source code(tar.gz)
    Source code(zip)
Owner
scikit-learn compatible projects
null
functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

Coursera 188 Nov 24, 2022
Directions overlay for working with pandas in an analysis environment

dovpanda Directions OVer PANDAs Directions are hints and tips for using pandas in an analysis environment. dovpanda is an overlay companion for workin

dovpandev 431 Dec 20, 2022
Fully Automated YouTube Channel ▶️with Added Extra Features.

Fully Automated Youtube Channel ▒█▀▀█ █▀▀█ ▀▀█▀▀ ▀▀█▀▀ █░░█ █▀▀▄ █▀▀ █▀▀█ ▒█▀▀▄ █░░█ ░░█░░ ░▒█░░ █░░█ █▀▀▄ █▀▀ █▄▄▀ ▒█▄▄█ ▀▀▀▀ ░░▀░░ ░▒█░░ ░▀▀▀ ▀▀▀░

sam-sepiol 249 Jan 2, 2023
The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

Eyal Trabelsi 206 Dec 13, 2022
PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.

PdpCLI Quick Links Introduction Installation Tutorial Basic Usage Data Reader / Writer Plugins Introduction PdpCLI is a pandas DataFrame processing CL

Yasuhiro Yamaguchi 15 Jan 7, 2022
Pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code

pandas-method-chaining pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code. It is a fork from pandas-v

Francis 5 May 14, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 4, 2023
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Dec 31, 2022
The windML framework provides an easy-to-use access to wind data sources within the Python world, building upon numpy, scipy, sklearn, and matplotlib. Renewable Wind Energy, Forecasting, Prediction

windml Build status : The importance of wind in smart grids with a large number of renewable energy resources is increasing. With the growing infrastr

Computational Intelligence Group 125 Dec 24, 2022
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm

Daniel Han-Chen 1.4k Jan 1, 2023
a feature engineering wrapper for sklearn

Few Few is a Feature Engineering Wrapper for scikit-learn. Few looks for a set of feature transformations that work best with a specified machine lear

William La Cava 47 Nov 18, 2022
A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.

Master status: Development status: Package information: MDR A scikit-learn-compatible Python implementation of Multifactor Dimensionality Reduction (M

Epistasis Lab at UPenn 122 Jul 6, 2022
Hyper-parameter optimization for sklearn

hyperopt-sklearn Hyperopt-sklearn is Hyperopt-based model selection among machine learning algorithms in scikit-learn. See how to use hyperopt-sklearn

null 1.4k Jan 1, 2023
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

null 2.1k Jan 2, 2023
Machine learning template for projects based on sklearn library.

Machine learning template for projects based on sklearn library.

Janez Lapajne 17 Oct 28, 2022
Test symmetries with sklearn decision tree models

Test symmetries with sklearn decision tree models Setup Begin from an environment with a recent version of python 3. source setup.sh Leave the enviro

Rupert Tombs 2 Jul 19, 2022
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

null 1 Dec 13, 2021
Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

IDLab Services 90 Oct 28, 2022
Turning images into '9-pan' palettes using KMeans clustering from sklearn.

img2palette Turning images into '9-pan' palettes using KMeans clustering from sklearn. Requirements We require: Pillow, for opening and processing ima

Samuel Vidovich 2 Jan 1, 2022