scikit-learn: machine learning in Python

Overview

Azure Travis Codecov CircleCI Nightly wheels Black PythonVersion PyPi DOI

doc/logos/scikit-learn-logo.png

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

It is currently maintained by a team of volunteers.

Website: https://scikit-learn.org

Installation

Dependencies

scikit-learn requires:

  • Python (>= 3.7)
  • NumPy (>= 1.14.6)
  • SciPy (>= 1.1.0)
  • joblib (>= 0.11)
  • threadpoolctl (>= 2.0.0)

Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. scikit-learn 0.23 and later require Python 3.6 or newer. scikit-learn 1.0 and later require Python 3.7 or newer.

Scikit-learn plotting capabilities (i.e., functions start with plot_ and classes end with "Display") require Matplotlib (>= 2.2.2). For running the examples Matplotlib >= 2.2.2 is required. A few examples require scikit-image >= 0.14.5, a few examples require pandas >= 0.25.0, some examples require seaborn >= 0.9.0.

User installation

If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using pip

pip install -U scikit-learn

or conda:

conda install -c conda-forge scikit-learn

The documentation includes more detailed installation instructions.

Changelog

See the changelog for a history of notable changes to scikit-learn.

Development

We welcome new contributors of all experience levels. The scikit-learn community goals are to be helpful, welcoming, and effective. The Development Guide has detailed information about contributing code, documentation, tests, and more. We've included some basic information in this README.

Important links

Source code

You can check the latest sources with the command:

git clone https://github.com/scikit-learn/scikit-learn.git

Contributing

To learn more about making a contribution to scikit-learn, please see our Contributing guide.

Testing

After installation, you can launch the test suite from outside the source directory (you will need to have pytest >= 5.0.1 installed):

pytest sklearn

See the web page https://scikit-learn.org/dev/developers/advanced_installation.html#testing for more information.

Random number generation can be controlled during testing by setting the SKLEARN_SEED environment variable.

Submitting a Pull Request

Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: https://scikit-learn.org/stable/developers/index.html

Project History

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

The project is currently maintained by a team of volunteers.

Note: scikit-learn was previously referred to as scikits.learn.

Help and Support

Documentation

Communication

Citation

If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn

Comments
  • Comparing the performance of different clustering algorithms on toy datasets on adding high dimensional gaussian noise

    Comparing the performance of different clustering algorithms on toy datasets on adding high dimensional gaussian noise

    Aim: Analyzing how the performance of different clustering algorithms for different datasets change on adding noise with different dimensions:

    This demo is a Jupyter Notebook documentation describing the effect of the addition of different dimensions of noise on a dataset. Here different types of synthetic datasets are generated on which the experiment is performed. To these datasets Gaussian noise of different dimensions are added, and the performance of each clustering algorithm is measured after noise addition. This is repeated for noise with different variances.

    Output: The plots that compare the effect of varying noise dimensions on different clustering algorithms for each of the datasets. In this set of subplots, the variance of the added noise changes along the column and the dataset changes along the row.

    Link to the demo: https://nbviewer.jupyter.org/github/sree0917/scikit-learn/blob/master/clustering_comparison_pr.ipynb

    opened by sree0917 7
  • Analyzing effect of dimensionality reduction on accuracy of different classifiers on different types of datasets

    Analyzing effect of dimensionality reduction on accuracy of different classifiers on different types of datasets

    It will be a document showing the Effect of Dimensionality Reduction on the accuracy of different classifiers. The document will have simulations On High dimensional dataset of different shapes: Each dataset is synthesized from sklearn established Datasets. Each dataset has 1000 dimensions with only 2 dimensions of data and rest are noise dimensions.

    Questions to answer:

    Analyzing how dimensionality reduction helps in classification for different classifiers. Analyzing how classifiers perform with a different number of reduced datasets from the main high dimensional dataset. Pipeline to be followed:

    Defining a dataset with sklearn established synthetic datasets with High dimensions. Performing classification on data and measuring accuracy for quantification of the process. performing the Dimensionality reduction technique keeping varying numbers of reduced dimensions. Checking the performance of classification again after reducing dimension after each iteration. The output of the PR would be a figure showing different datasets, comparing accuracies of different classifiers with and without dimensionality reduction and a plot showing varying accuracies over reduced dimensions.

    Experiments to follow: https://github.com/parimal173/scikit-learn/blob/parimal173-patch-1/Final_50trials_9datasets.ipynb

    opened by parimal173 6
  • FIX Adjust builder to support a different split record

    FIX Adjust builder to support a different split record

    This is an example of how to get builder to work with the different types of split records.

    This is functionally but I do not like the malloc + free in the builder.

    All the setup.py language="c++" changes were needed to get the PR to compile on my machine

    CC @adam2392

    opened by thomasjpfan 4
  • Merge main to obliquepr

    Merge main to obliquepr

    I did not have the rights to directly push to the PR branch.

    There is no reason to skip a common test for a new PR, so I removed the new classes from the ignorelist. If common tests fail, they should be fixed.

    opened by ogrisel 0
  • [DRAFT] Implement histogram binning

    [DRAFT] Implement histogram binning

    Co-Authored-By: p-teng [email protected]

    Reference Issues/PRs

    Fixes: #23

    What does this implement/fix? Explain your changes.

    Any other comments?

    opened by adam2392 0
  • Create RF vs OF benchmarking notebook

    Create RF vs OF benchmarking notebook

    Reference Issues/PRs

    Compare TC of OF against RF using bench_tree.py criteria

    What does this implement/fix? Explain your changes.

    [x] Add a notebook that compares TC of RF and OF varying max_features

    Any other comments?

    @adam2392 - I've made a separate dev_notebook folder where I placed the notebook

    opened by jshinm 0
  • Minor Fix on ObliqueRandomForestClassifier

    Minor Fix on ObliqueRandomForestClassifier

    Reference Issues/PRs

    N/A

    What does this implement/fix? Explain your changes.

    • [x] ObliqueRandomForestClassifier() initializes correctly
    • [x] ObliqueRandomForestClassifier() can be found in sklearn.ensemble

    Any other comments?

    @adam2392 -- please review the correction

    opened by jshinm 0
  • EHN transfer streaming tree function to fork

    EHN transfer streaming tree function to fork

    Reference Issues/PRs

    scikit-learn#18888 scikit-learn#18889

    What does this implement/fix? Explain your changes.

    Add partial_fit function to DecisionTreeClassifier

    Any other comments?

    enhancement 
    opened by PSSF23 0
  • Add files via uploaAnalyzing effect of dimensionality reduction on accuracy of different classifiers on different types of datasetsd

    Add files via uploaAnalyzing effect of dimensionality reduction on accuracy of different classifiers on different types of datasetsd

    It will be a document showing the Effect of Dimensionality Reduction on the accuracy of different classifiers. The document will have simulations On High dimensional dataset of different shapes: Each dataset is synthesized from sklearn established Datasets. Each dataset has 1000 dimensions with only 2 dimensions of data and rest are noise dimensions.

    Questions to answer:

    Analyzing how dimensionality reduction helps in classification for different classifiers. Analyzing how classifiers perform with a different number of reduced datasets from the main high dimensional dataset. Pipeline to be followed:

    Defining a dataset with sklearn established synthetic datasets with High dimensions. Performing classification on data and measuring accuracy for quantification of the process. performing the Dimensionality reduction technique keeping varying numbers of reduced dimensions. Checking the performance of classification again after reducing dimension after each iteration. The output of the PR would be a figure showing different datasets, comparing accuracies of different classifiers with and without dimensionality reduction and a plot showing varying accuracies over reduced dimensions.

    Experiments to follow: https://github.com/NeuroDataDesign/team-forbidden-forest/blob/master/Parimal%20Joshi/Final_pr_2.ipynb

    opened by parimal173 0
  • [DRAFT] Allowing trees to bin data

    [DRAFT] Allowing trees to bin data

    Reference Issues/PRs

    Fixes: #23

    What does this implement/fix? Explain your changes.

    This aims to implement the binning capabilities to massively improves the speed of training decision trees. Currently, this is trying to add the binning capabilities that play well with the existing codebase.

    Unfortunately, the code from #24 is far from complete and does not do the job.

    Right now, what is missing is:

    • how to implement binning that is consistent in fitting and predicting (i.e. apply) API
    • can we simplify the API?

    Any other comments?

    opened by adam2392 0
  • [ENH] Adding binning capabilities to decision trees

    [ENH] Adding binning capabilities to decision trees

    Describe the workflow you want to enable

    I am part of the @neurodata team. Binning features have resulted in highly efficient and little loss in performance in gradient-boosted trees. This feature should not only be used in gradient-boosted trees, but should be available within all decision trees [1].

    By including binning as a feature for decision trees, we would enable massive speedups for decision trees that operate on high-dimensional data (both in features and sample sizes). This would be an additional tradeoff that users can take. The intuition behind binning for decision trees would be exactly that of Gradient Boosted Trees.

    Describe your proposed solution

    We propose introducing binning to the decision tree classifier and regressor.

    An initial PR is proposed here: https://github.com/neurodata/scikit-learn/pull/24#pullrequestreview-1009913319 However, it seems that many of the files were copied and it is not 100% clear if needed. Perhaps we can explore how to consolidate the _binning.py/pyx files using the current versions under ensemble/_hist_gradient_boosting/*.

    Changes to the Cython codebase

    TBD

    Changes to the Python API

    The following two parameters would be added to the DecisionTreeClassifier and Regressor:

    hist_binning=False,
    max_bins=255
    

    where the default number of bins follows that of histgradient boosting.

    Additional context

    These changes can also trivially be applied to Oblique Trees.

    References: [1] https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree

    Needs Triage 
    opened by adam2392 5
  • Add interpretability example notebooks

    Add interpretability example notebooks

    Reference Issues/PRs

    What does this implement/fix? Explain your changes.

    Add 3 interpretability example notebooks

    • [x] Iris notebook
    • [x] Simulation notebook
    • [x] MNIST notebook

    Any other comments?

    opened by jshinm 2
  • [TEST PR] Adding oblique trees (i.e. Forest-RC) to cythonized tree module

    [TEST PR] Adding oblique trees (i.e. Forest-RC) to cythonized tree module

    Reference Issues/PRs

    Fixes:

    What does this implement/fix? Explain your changes.

    Adds cythonized oblique trees to the tree module. This is known as Forest-RC in the Breiman 2001 paper.

    • _oblique_tree.pxd/pyx: This file implements i) the ObliqueTree(Tree), which defines a few additional class members for storing the projection weight and indices and a new function for adding and oblique node and then ii) the ObliqueTreeBuilder(TreeBuilder), which defines how to build the oblique tree.
    • _oblique_splitter.pxd/pyx: This is the main change, which i) defines an ObliqueSplitRecord for keeping track of oblique splits, and ii) defines an ObliqueSplitter(Splitter) which gets oblique node splits and samples projection matrices, while also storing additional hyperparameters.
    • _classes.py: Defines new Python interfaces for the Oblique trees and forests

    Any other comments?

    I'm not an expert in cython and c++ interplay, but I suspect that if we can "generalize" the Node struct to carry projection vector and weight information (not used in Forest-RI, or axis-aligned Random Forest), then much of the tree, tree building code is not even necessary. The only thing that is different at a fundamental level is the idea of a sample_proj_mat at each node of the tree, which samples sparse combinations.

    Another missing component currently is the implementation on sparse data, but this should be easily added in I presume.

    Code That Can Be Shortened

    New data structures and classes, ObliqueSplitRecord, ObliqueSplitter, ObliqueTree are defined. However, if the existing SplitRecord, Splitter, Tree can be generalized, then the existing functions can just be used to build Oblique Trees too.

    However, I'm not sure if the scikit-learn devs would want that, rather then just replicating some code across these two cases?

    opened by adam2392 1
  • Spectral Embedding with Asymmetric Matrices/ Directed Graphs

    Spectral Embedding with Asymmetric Matrices/ Directed Graphs

    Describe the workflow you want to enable

    Currently, the sklearn.manifold.SpectralEmbedding is restricted to symmetric affinity matrices, and if an asymmetric matrix is passed it is converted through sklearn.utils.validation.check_symmetric into a symmetric matrix. However, in doing so, one loses the underlying asymmetries and potential directional clusters present in the adjacency matrix of the directed graph input.

    Describe your proposed solution

    The algorithms I propose adding use singular value decomposition, as opposed to eignendecomposition, and a modified Laplacian to perform spectral embedding on directed graphs/asymmetric matrices. Specifically I would like to propose adding Adjacency / Laplacian spectral embedding, ASE and LSE respectively. My thoughts would be to add a new class,sklearn.manifold.DirectedSpectralEmbedding, which users may call directly, or is dispatched to when an asymmetric matrix is passed to sklearn.manifold.SpectralEmbedding. Similarly to SpectralEmbedding, users would specify between ASE and LSE through an affinity parameter.

    Additional context

    These algorithms have been implemented in GraSPy, (available at ASE and LSE), inputing a graph represented as a dense or sparse matrix, and returning the appropriate embedding.

    opened by asaadeldin11 0
To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

Astitva Veer Garg 1 Jan 11, 2022
Painless Machine Learning for python based on scikit-learn

PlainML Painless Machine Learning Library for python based on scikit-learn. Install pip install plainml Example from plainml import KnnModel, load_ir

null 1 Aug 6, 2022
Automated Machine Learning with scikit-learn

auto-sklearn auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. Find the documentation here

AutoML-Freiburg-Hannover 6.7k Jan 7, 2023
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 204 Nov 18, 2022
Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

Siva Prakash 5 Apr 5, 2022
Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

Penguins Classification App Penguins species predictor app is used to classify penguins species using their island, sex, bill length (mm), bill depth

Siva Prakash 3 Apr 5, 2022
K-Means clusternig example with Python and Scikit-learn

Unsupervised-Machine-Learning Flat Clustering K-Means clusternig example with Python and Scikit-learn Flat clustering Clustering algorithms group a se

Emin 1 Dec 13, 2021
Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

Tom Weichle 2 Apr 18, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

null 802 Jan 1, 2023
Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Highly interpretable, sklearn-compatible classifier based on decision rules This is a scikit-learn compatible wrapper for the Bayesian Rule List class

Tamas Madl 482 Nov 19, 2022
Distributed scikit-learn meta-estimators in PySpark

sk-dist: Distributed scikit-learn meta-estimators in PySpark What is it? sk-dist is a Python package for machine learning built on top of scikit-learn

Ibotta 282 Dec 9, 2022
A collection of Scikit-Learn compatible time series transformers and tools.

tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

Chris Santiago 0 Mar 30, 2022
Scikit learn library models to account for data and concept drift.

liquid_scikit_learn Scikit learn library models to account for data and concept drift. This python library focuses on solving data drift and concept d

null 7 Nov 18, 2021
Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
Scikit-Learn useful pre-defined Pipelines Hub

Scikit-Pipes Scikit-Learn useful pre-defined Pipelines Hub Usage: Install scikit-pipes It's advised to install sklearn-genetic using a virtual env, in

Rodrigo Arenas 1 Apr 26, 2022
icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

icepickle It's a cooler way to store simple linear models. The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-lea

vincent d warmerdam 24 Dec 9, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
A demo project to elaborate how Machine Learn Models are deployed on production using Flask API

This is a salary prediction website developed with the help of machine learning, this makes prediction of salary on basis of few parameters like interview score, experience test score.

null 1 Feb 10, 2022