scikit-learn addon to operate on set/"group"-based features

Overview

Travis

skl-groups

skl-groups is a package to perform machine learning on sets (or "groups") of features in Python. It extends the scikit-learn library with support for either transforming sets into feature vectors that can be operated on with standard scikit-learn constructs or obtaining pairwise similarity/etc matrices that can be turned into kernels for use in scikit-learn.

For an introduction to the package, why you might want to use it, and how to do so, check out the documentation.

skl-groups is still in fairly early development. The precursor package, py-sdm, is still somewhat easier to use for some tasks (though it has less functionality and less documentation); skl-groups will hopefully match it in the next few weeks. Feel free to get in touch ([email protected]) if you're interested.

Installation

Full instructions are in the documentation, but the short version is to do:

$ conda install -c dougal -c r skl-groups

if you use conda, or:

$ pip install skl-groups

if not. If you pip install and want to use the kNN divergence estimator, you'll need to install either cyflann or the regular pyflann bindings to FLANN, and you'll want a version of FLANN with OpenMP support.

A much faster version of the kNN estimator is enabled by the skl-groups-accel package, which you can get via:

$ pip install skl-groups-accel

It requires cyflann and a working C compiler with OpenMP support (i.e. gcc, not clang).

Comments
  • inf and nan values when linear and l2 divergences are used

    inf and nan values when linear and l2 divergences are used

    Hi @dougalsutherland ,

    Thanks for sharing your code; it is well documented and well written.

    I am working on a problem and comparing different divergences. KL and Hellinger already produce good results but I am interested to compute linear affinity for a different purpose. Unfortunately more than 99% of the computed affinity values are infinity and the rest are very large number. Do you why is that and it can be resolved? l2 distance also produce a lot of inf and nan.

    Thanks, Kayhan

    opened by kayhan-batmanghelich 7
  • TypeError in knn.pyc

    TypeError in knn.pyc

    Hi Dougal, I wanted to use skl-groups on some accelerometer data I have and was going through your example. I installed everything through conda and everything works until I try to fit the KNNDivergenceEstimator object. Any ideas?

    from sklearn.pipeline import Pipeline
    from skl_groups.divergences import KNNDivergenceEstimator
    from skl_groups.kernels import PairwisePicker, Symmetrize, RBFize, ProjectPSD
    model = Pipeline([
        ('divs', KNNDivergenceEstimator(div_funcs=['kl'], Ks=[2])),
        ('pick', PairwisePicker((0, 0))),
        ('symmetrize', Symmetrize()),
        ('rbf', RBFize(gamma=1, scale_by_median=True)),
        ('project', ProjectPSD()),
        ('svm', SVC(C=1, kernel='precomputed')),
    ])
    model.fit(feats[:6], labels[:6]).predict(feats[6:])
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-15-454843114715> in <module>()
         10     ('svm', SVC(C=1, kernel='precomputed')),
         11 ])
    ---> 12 model.fit(feats[:6], labels[:6]).predict(feats[6:])
    
    /Users/wgmueller/anaconda/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
        127         data, then fit the transformed data using the final estimator.
        128         """
    --> 129         Xt, fit_params = self._pre_transform(X, y, **fit_params)
        130         self.steps[-1][-1].fit(Xt, y, **fit_params)
        131         return self
    
    /Users/wgmueller/anaconda/lib/python2.7/site-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
        117         for name, transform in self.steps[:-1]:
        118             if hasattr(transform, "fit_transform"):
    --> 119                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
        120             else:
        121                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
    
    /Users/wgmueller/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
        427         else:
        428             # fit method of arity 2 (supervised transformation)
    --> 429             return self.fit(X, y, **fit_params).transform(X)
        430 
        431 
    
    /Users/wgmueller/anaconda/lib/python2.7/site-packages/skl_groups/divergences/knn.pyc in fit(self, X, y, get_rhos)
        308             memory = Memory(cachedir=memory, verbose=0)
        309 
    --> 310         self.indices_ = id = memory.cache(_build_indices)(X, self._flann_args())
        311         if get_rhos:
        312             self.rhos_ = _get_rhos(X, id, Ks, max_K, save_all_Ks, self.min_dist)
    
    /Users/wgmueller/anaconda/lib/python2.7/site-packages/skl_groups/divergences/knn.pyc in _flann_args(self, X)
        264         # check that arguments are correct
        265         try:
    --> 266             FLANNParameters().update(args)
        267         except AttributeError as e:
        268             msg = "flann_args contains an invalid argument:\n  {}"
    
    TypeError: update() takes exactly 0 positional arguments (1 given)
    
    opened by wgmueller1 3
  • negative kNN based divergences

    negative kNN based divergences

    I've been using KNNDivergenceEstimator to estimate the divergence between two samples. This might be a dumb question, but how can the renyi-alpha divergences estimates be negative?

    I understand that if I set clamp=True then it imposes a limit, but I'm not sure how the divergence estimator from B. Poczos, L. Xiong, D. J. Sutherland, & J. Schneider (2012) could return negative values.

    Thanks!

    opened by changhoonhahn 1
  • add L2 density estimate projection series transformer

    add L2 density estimate projection series transformer

    Like in Junier's paper: http://jmlr.org/proceedings/papers/v33/oliva14a.pdf (but the RBF/ridge stuff can be handled by pipelines with sklearn.kernel_approximation.RBFSampler and sklearn.linear_model.Ridge)

    Probably only support the cosine basis, but maybe design the API to allow other options.

    enhancement 
    opened by djsutherland 0
  • might be a bug in transform

    might be a bug in transform

    Hi Dougal,

    I think, there might be a bug here in this line about the median trick:

    https://github.com/dougalsutherland/skl-groups/blob/2584c10a413626c6d5f9078cdbf3dcc84e4e9a5b/skl_groups/kernels/transform.py#L197

    I think scale = 1/median_ as you also mentioned in the help of the function. Your gamma is the inverse of the bandwidth (squared).

    Besr

    opened by kayhan-batmanghelich 0
  • SVD issue

    SVD issue

    Hi Dougal,

    I recently updated package and conda and I run into this issue:

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-16-75fc8d6dda07> in <module>()
    ----> 1 divKernel_PSD, K_Chol = computeKernel(div)
    
    <ipython-input-15-adc14e99cc3e> in computeKernel(divMatrix, gammaVal, projectName)
         18     tic()
         19     print( "Projecting on the PSD cone ...." )
    ---> 20     K_PSD = fcn().fit_transform( div_symmet_norm_RBF )
         21     print( "Done !" )
         22     toc()
    
    /home/batmanghelich/anaconda2/lib/python2.7/site-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
        492         if y is None:
        493             # fit method of arity 1 (unsupervised transformation)
    --> 494             return self.fit(X, **fit_params).transform(X)
        495         else:
        496             # fit method of arity 2 (supervised transformation)
    
    /home/batmanghelich/anaconda2/lib/python2.7/site-packages/skl_groups/kernels/transform.pyc in fit(self, X, y)
        542 
        543         memory = get_memory(self.memory)
    --> 544         lo, = memory.cache(scipy.linalg.eigvalsh)(X, eigvals=(0, 0))
        545         self.shift_ = max(self.min_eig - lo, 0)
        546 
    
    /home/batmanghelich/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/memory.pyc in __call__(self, *args, **kwargs)
        281 
        282     def __call__(self, *args, **kwargs):
    --> 283         return self.func(*args, **kwargs)
        284 
        285     def call_and_shelve(self, *args, **kwargs):
    
    /home/batmanghelich/anaconda2/lib/python2.7/site-packages/scipy/linalg/decomp.pyc in eigvalsh(a, b, lower, overwrite_a, overwrite_b, turbo, eigvals, type, check_finite)
        682                 overwrite_a=overwrite_a, overwrite_b=overwrite_b,
        683                 turbo=turbo, eigvals=eigvals, type=type,
    --> 684                 check_finite=check_finite)
        685 
        686 
    
    /home/batmanghelich/anaconda2/lib/python2.7/site-packages/scipy/linalg/decomp.pyc in eigh(a, b, lower, eigvals_only, overwrite_a, overwrite_b, turbo, eigvals, type, check_finite)
        347             (lo, hi) = eigvals
        348             w_tot, v, info = evr(a1, uplo=uplo, jobz=_job, range="I",
    --> 349                                  il=lo, iu=hi, overwrite_a=overwrite_a)
        350             w = w_tot[0:hi-lo+1]
        351 
    
    ValueError: On entry to SSBRDB parameter number 12 had an illegal value
    

    As you see svd issue from scipy. I couldn't find a good solution for it. Do you have any idea how to resolve it?

    Thanks

    opened by kayhan-batmanghelich 2
  • log-space computations

    log-space computations

    Started in https://github.com/dougalsutherland/skl-groups/commit/2584c10a413626c6d5f9078cdbf3dcc84e4e9a5b, but:

    • should check it doesn't slow things down too much
    • should think a little more carefully about the exact things we're doing
    • should add a numerical stability check
    • should do it for the Cython alpha-divergence estimator too
    opened by djsutherland 0
  • add MMD two sample tests

    add MMD two sample tests

    • [ ] Helper function to compute the two-sample test statistic
      • [ ] Kernel learning
    • Null statistic via:
      • [ ] bootstrapping
      • [ ] concentration inequalities
      • [ ] asymptotic approximations
      • ... what other ways are there?
    enhancement 
    opened by djsutherland 0
Owner
Danica J. Sutherland
Machine learning professor.
Danica J. Sutherland
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

Epistasis Lab at UPenn 374 Dec 15, 2022
Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

Blue Yonder GmbH 7k Jan 3, 2023
Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn.

Repository Status for Scikit-learn Live webpage Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn. Running local

Thomas J. Fan 6 Dec 27, 2022
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 5, 2022
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 5, 2022
Let's learn how to build, release and operate your containerized applications to Amazon ECS and AWS Fargate using AWS Copilot.

?? Welcome to AWS Copilot Workshop In this workshop, you'll learn how to build, release and operate your containerised applications to Amazon ECS and

Donnie Prakoso 15 Jul 14, 2022
MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

Microsoft 5.7k Jan 9, 2023
PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

Lensa 1.1k Jan 4, 2023
Python based script to operate FFMPEG.

FMPConvert Python based script to operate FFMPEG. Ver 1.0 -- 2022.02.08 Feature ✅ Maximum compatibility: Third-party dependency libraries unused ✅ Che

cybern000b 1 Feb 28, 2022
Automatically download the cwru data set, and then divide it into training data set and test data set

Automatically download the cwru data set, and then divide it into training data set and test data set.自动下载cwru数据集,然后分训练数据集和测试数据集

null 6 Jun 27, 2022
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR2022)[paper] Authors: Chenhang He, Ruihuang Li, Shuai Li, L

Billy HE 141 Dec 30, 2022
PyTime is an easy-use Python module which aims to operate date/time/datetime by string.

PyTime PyTime is an easy-use Python module which aims to operate date/time/datetime by string. PyTime allows you using nonregular datetime string to g

Sinux 148 Dec 9, 2022
If you only have hash, you can still operate exchange

PTH Exchange If you only have hash, you can still operate exchange This project module is the same as my other project Exchange_SSRF, This project use

Jumbo 37 Dec 26, 2022
An example Music Bot written in Disnake and uses slash commands to operate.

Music Bot An example music bot that is written in Disnake [Maintained discord.py Fork] Disnake Disnake is a maintained and updated fork of discord.py.

null 6 Jan 8, 2022
Pyoccur - Python package to operate on occurrences (duplicates) of elements in lists

pyoccur Python Occurrence Operations on Lists About Package A simple python package with 3 functions has_dup() get_dup() remove_dup() Currently the du

Ahamed Musthafa 6 Jan 7, 2023
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

null 802 Jan 1, 2023
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

Epistasis Lab at UPenn 374 Dec 15, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

null 803 Jan 5, 2023
Painless Machine Learning for python based on scikit-learn

PlainML Painless Machine Learning Library for python based on scikit-learn. Install pip install plainml Example from plainml import KnnModel, load_ir

null 1 Aug 6, 2022
SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

SciKit-Learn Laboratory This Python package provides command-line utilities to make it easier to run machine learning experiments with scikit-learn. O

ETS 528 Nov 25, 2022