scikit-learn addon to operate on set/"group"-based features

Danica J. Sutherland

Last update: Apr 6, 2022

Related tags

Feature Engineering skl-groups

Overview

skl-groups

skl-groups is a package to perform machine learning on sets (or "groups") of features in Python. It extends the scikit-learn library with support for either transforming sets into feature vectors that can be operated on with standard scikit-learn constructs or obtaining pairwise similarity/etc matrices that can be turned into kernels for use in scikit-learn.

For an introduction to the package, why you might want to use it, and how to do so, check out the documentation.

skl-groups is still in fairly early development. The precursor package, py-sdm, is still somewhat easier to use for some tasks (though it has less functionality and less documentation); skl-groups will hopefully match it in the next few weeks. Feel free to get in touch ([email protected]) if you're interested.

Installation

Full instructions are in the documentation, but the short version is to do:

$ conda install -c dougal -c r skl-groups

if you use conda, or:

$ pip install skl-groups

if not. If you pip install and want to use the kNN divergence estimator, you'll need to install either cyflann or the regular pyflann bindings to FLANN, and you'll want a version of FLANN with OpenMP support.

A much faster version of the kNN estimator is enabled by the skl-groups-accel package, which you can get via:

$ pip install skl-groups-accel

It requires cyflann and a working C compiler with OpenMP support (i.e. gcc, not clang).

Comments

inf and nan values when linear and l2 divergences are used

Hi @dougalsutherland ,

Thanks for sharing your code; it is well documented and well written.

I am working on a problem and comparing different divergences. KL and Hellinger already produce good results but I am interested to compute linear affinity for a different purpose. Unfortunately more than 99% of the computed affinity values are infinity and the rest are very large number. Do you why is that and it can be resolved? l2 distance also produce a lot of inf and nan.

Thanks, Kayhan

opened by kayhan-batmanghelich 7

TypeError in knn.pyc

Hi Dougal, I wanted to use skl-groups on some accelerometer data I have and was going through your example. I installed everything through conda and everything works until I try to fit the KNNDivergenceEstimator object. Any ideas?

from sklearn.pipeline import Pipeline
from skl_groups.divergences import KNNDivergenceEstimator
from skl_groups.kernels import PairwisePicker, Symmetrize, RBFize, ProjectPSD
model = Pipeline([
    ('divs', KNNDivergenceEstimator(div_funcs=['kl'], Ks=[2])),
    ('pick', PairwisePicker((0, 0))),
    ('symmetrize', Symmetrize()),
    ('rbf', RBFize(gamma=1, scale_by_median=True)),
    ('project', ProjectPSD()),
    ('svm', SVC(C=1, kernel='precomputed')),
])
model.fit(feats[:6], labels[:6]).predict(feats[6:])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-454843114715> in <module>()
     10     ('svm', SVC(C=1, kernel='precomputed')),
     11 ])
---> 12 model.fit(feats[:6], labels[:6]).predict(feats[6:])

/Users/wgmueller/anaconda/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
    127         data, then fit the transformed data using the final estimator.
    128         """
--> 129         Xt, fit_params = self._pre_transform(X, y, **fit_params)
    130         self.steps[-1][-1].fit(Xt, y, **fit_params)
    131         return self

/Users/wgmueller/anaconda/lib/python2.7/site-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
    117         for name, transform in self.steps[:-1]:
    118             if hasattr(transform, "fit_transform"):
--> 119                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    120             else:
    121                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/Users/wgmueller/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
    427         else:
    428             # fit method of arity 2 (supervised transformation)
--> 429             return self.fit(X, y, **fit_params).transform(X)
    430 
    431 

/Users/wgmueller/anaconda/lib/python2.7/site-packages/skl_groups/divergences/knn.pyc in fit(self, X, y, get_rhos)
    308             memory = Memory(cachedir=memory, verbose=0)
    309 
--> 310         self.indices_ = id = memory.cache(_build_indices)(X, self._flann_args())
    311         if get_rhos:
    312             self.rhos_ = _get_rhos(X, id, Ks, max_K, save_all_Ks, self.min_dist)

/Users/wgmueller/anaconda/lib/python2.7/site-packages/skl_groups/divergences/knn.pyc in _flann_args(self, X)
    264         # check that arguments are correct
    265         try:
--> 266             FLANNParameters().update(args)
    267         except AttributeError as e:
    268             msg = "flann_args contains an invalid argument:\n  {}"

TypeError: update() takes exactly 0 positional arguments (1 given)

opened by wgmueller1 3

negative kNN based divergences

I've been using KNNDivergenceEstimator to estimate the divergence between two samples. This might be a dumb question, but how can the renyi-alpha divergences estimates be negative?

I understand that if I set clamp=True then it imposes a limit, but I'm not sure how the divergence estimator from B. Poczos, L. Xiong, D. J. Sutherland, & J. Schneider (2012) could return negative values.

Thanks!

opened by changhoonhahn 1
add L2 density estimate projection series transformer

Like in Junier's paper: http://jmlr.org/proceedings/papers/v33/oliva14a.pdf (but the RBF/ridge stuff can be handled by pipelines with sklearn.kernel_approximation.RBFSampler and sklearn.linear_model.Ridge)

Probably only support the cosine basis, but maybe design the API to allow other options.
enhancement

opened by djsutherland 0
might be a bug in transform

Hi Dougal,

I think, there might be a bug here in this line about the median trick:

https://github.com/dougalsutherland/skl-groups/blob/2584c10a413626c6d5f9078cdbf3dcc84e4e9a5b/skl_groups/kernels/transform.py#L197

I think scale = 1/median_ as you also mentioned in the help of the function. Your gamma is the inverse of the bandwidth (squared).

Besr

opened by kayhan-batmanghelich 0

SVD issue

Hi Dougal,

I recently updated package and conda and I run into this issue:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-75fc8d6dda07> in <module>()
----> 1 divKernel_PSD, K_Chol = computeKernel(div)

<ipython-input-15-adc14e99cc3e> in computeKernel(divMatrix, gammaVal, projectName)
     18     tic()
     19     print( "Projecting on the PSD cone ...." )
---> 20     K_PSD = fcn().fit_transform( div_symmet_norm_RBF )
     21     print( "Done !" )
     22     toc()

/home/batmanghelich/anaconda2/lib/python2.7/site-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
    492         if y is None:
    493             # fit method of arity 1 (unsupervised transformation)
--> 494             return self.fit(X, **fit_params).transform(X)
    495         else:
    496             # fit method of arity 2 (supervised transformation)

/home/batmanghelich/anaconda2/lib/python2.7/site-packages/skl_groups/kernels/transform.pyc in fit(self, X, y)
    542 
    543         memory = get_memory(self.memory)
--> 544         lo, = memory.cache(scipy.linalg.eigvalsh)(X, eigvals=(0, 0))
    545         self.shift_ = max(self.min_eig - lo, 0)
    546 

/home/batmanghelich/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/memory.pyc in __call__(self, *args, **kwargs)
    281 
    282     def __call__(self, *args, **kwargs):
--> 283         return self.func(*args, **kwargs)
    284 
    285     def call_and_shelve(self, *args, **kwargs):

/home/batmanghelich/anaconda2/lib/python2.7/site-packages/scipy/linalg/decomp.pyc in eigvalsh(a, b, lower, overwrite_a, overwrite_b, turbo, eigvals, type, check_finite)
    682                 overwrite_a=overwrite_a, overwrite_b=overwrite_b,
    683                 turbo=turbo, eigvals=eigvals, type=type,
--> 684                 check_finite=check_finite)
    685 
    686 

/home/batmanghelich/anaconda2/lib/python2.7/site-packages/scipy/linalg/decomp.pyc in eigh(a, b, lower, eigvals_only, overwrite_a, overwrite_b, turbo, eigvals, type, check_finite)
    347             (lo, hi) = eigvals
    348             w_tot, v, info = evr(a1, uplo=uplo, jobz=_job, range="I",
--> 349                                  il=lo, iu=hi, overwrite_a=overwrite_a)
    350             w = w_tot[0:hi-lo+1]
    351 

ValueError: On entry to SSBRDB parameter number 12 had an illegal value

As you see svd issue from scipy. I couldn't find a good solution for it. Do you have any idea how to resolve it?

Thanks

opened by kayhan-batmanghelich 2

log-space computations
Started in https://github.com/dougalsutherland/skl-groups/commit/2584c10a413626c6d5f9078cdbf3dcc84e4e9a5b, but:

should check it doesn't slow things down too much

should think a little more carefully about the exact things we're doing

should add a numerical stability check

should do it for the Cython alpha-divergence estimator too
opened by djsutherland 0
add MMD two sample tests
[ ] Helper function to compute the two-sample test statistic

[ ] Kernel learning

Null statistic via:

[ ] bootstrapping

[ ] concentration inequalities

[ ] asymptotic approximations

... what other ways are there?

enhancement
opened by djsutherland 0

Owner

Danica J. Sutherland

Machine learning professor.

GitHub

A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

374 Dec 15, 2022

Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

7k Jan 3, 2023

Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn.

Repository Status for Scikit-learn Live webpage Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn. Running local

6 Dec 27, 2022

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

380 Nov 5, 2022

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

380 Nov 5, 2022

Let's learn how to build, release and operate your containerized applications to Amazon ECS and AWS Fargate using AWS Copilot.

?? Welcome to AWS Copilot Workshop In this workshop, you'll learn how to build, release and operate your containerised applications to Amazon ECS and

15 Jul 14, 2022

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

5.7k Jan 9, 2023

PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

1.1k Jan 4, 2023

Python based script to operate FFMPEG.

FMPConvert Python based script to operate FFMPEG. Ver 1.0 -- 2022.02.08 Feature ✅ Maximum compatibility: Third-party dependency libraries unused ✅ Che

1 Feb 28, 2022

Automatically download the cwru data set, and then divide it into training data set and test data set

Automatically download the cwru data set, and then divide it into training data set and test data set.自动下载cwru数据集，然后分训练数据集和测试数据集

6 Jun 27, 2022

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR2022)[paper] Authors: Chenhang He, Ruihuang Li, Shuai Li, L

141 Dec 30, 2022

PyTime is an easy-use Python module which aims to operate date/time/datetime by string.

PyTime PyTime is an easy-use Python module which aims to operate date/time/datetime by string. PyTime allows you using nonregular datetime string to g

148 Dec 9, 2022

If you only have hash, you can still operate exchange

PTH Exchange If you only have hash, you can still operate exchange This project module is the same as my other project Exchange_SSRF, This project use

37 Dec 26, 2022

An example Music Bot written in Disnake and uses slash commands to operate.

Music Bot An example music bot that is written in Disnake [Maintained discord.py Fork] Disnake Disnake is a maintained and updated fork of discord.py.

6 Jan 8, 2022

Pyoccur - Python package to operate on occurrences (duplicates) of elements in lists

pyoccur Python Occurrence Operations on Lists About Package A simple python package with 3 functions has_dup() get_dup() remove_dup() Currently the du

6 Jan 7, 2023

A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

802 Jan 1, 2023

A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

374 Dec 15, 2022

scikit-learn addon to operate on set/"group"-based features

Related tags

Overview

skl-groups

Installation

Comments

inf and nan values when linear and l2 divergences are used

TypeError in knn.pyc

negative kNN based divergences

add L2 density estimate projection series transformer

might be a bug in transform

SVD issue

log-space computations

add MMD two sample tests

Owner

Danica J. Sutherland

A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Automatic extraction of relevant features from time series:

Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn.

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Let's learn how to build, release and operate your containerized applications to Amazon ECS and AWS Fargate using AWS Copilot.

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

PySpark + Scikit-learn = Sparkit-learn

Python based script to operate FFMPEG.

Automatically download the cwru data set, and then divide it into training data set and test data set

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

PyTime is an easy-use Python module which aims to operate date/time/datetime by string.

If you only have hash, you can still operate exchange

An example Music Bot written in Disnake and uses slash commands to operate.

Pyoccur - Python package to operate on occurrences (duplicates) of elements in lists

A scikit-learn based module for multi-label et. al. classification

A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

A scikit-learn based module for multi-label et. al. classification

Painless Machine Learning for python based on scikit-learn

SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.