scikit-learn: machine learning in Python

Overview

Azure Travis Codecov CircleCI Nightly wheels PythonVersion PyPi DOI

doc/logos/scikit-learn-logo.png

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

It is currently maintained by a team of volunteers.

Website: https://scikit-learn.org

Installation

Dependencies

scikit-learn requires:

  • Python (>= 3.6)
  • NumPy (>= 1.13.3)
  • SciPy (>= 0.19.1)
  • joblib (>= 0.11)
  • threadpoolctl (>= 2.0.0)

Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. scikit-learn 0.23 and later require Python 3.6 or newer.

Scikit-learn plotting capabilities (i.e., functions start with plot_ and classes end with "Display") require Matplotlib (>= 2.1.1). For running the examples Matplotlib >= 2.1.1 is required. A few examples require scikit-image >= 0.13, a few examples require pandas >= 0.25.0, some examples require seaborn >= 0.9.0.

User installation

If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using pip

pip install -U scikit-learn

or conda:

conda install -c conda-forge scikit-learn

The documentation includes more detailed installation instructions.

Changelog

See the changelog for a history of notable changes to scikit-learn.

Development

We welcome new contributors of all experience levels. The scikit-learn community goals are to be helpful, welcoming, and effective. The Development Guide has detailed information about contributing code, documentation, tests, and more. We've included some basic information in this README.

Important links

Source code

You can check the latest sources with the command:

git clone https://github.com/scikit-learn/scikit-learn.git

Contributing

To learn more about making a contribution to scikit-learn, please see our Contributing guide.

Testing

After installation, you can launch the test suite from outside the source directory (you will need to have pytest >= 5.0.1 installed):

pytest sklearn

See the web page https://scikit-learn.org/dev/developers/advanced_installation.html#testing for more information.

Random number generation can be controlled during testing by setting the SKLEARN_SEED environment variable.

Submitting a Pull Request

Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: https://scikit-learn.org/stable/developers/index.html

Project History

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

The project is currently maintained by a team of volunteers.

Note: scikit-learn was previously referred to as scikits.learn.

Help and Support

Documentation

Communication

Citation

If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn

Comments
  • MRG: AdaBoost for regression and multi-class classification

    MRG: AdaBoost for regression and multi-class classification

    This PR adds:

    • a new ensemble.weight_boosting module with AdaBoostRegressor (using AdaBoost.R2 [1]) and AdaBoostClassifier (using the multi-class SAMME algorithm [2])
    • a new "Gaussian quantiles" dataset in datasets.samples_generator as used in [2]

    Examples are provided:

    hastie

    twoclass

    multiclass

    regression

    [1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.314 [2] http://www.stanford.edu/~hastie/Papers/samme.pdf

    New Feature 
    opened by ndawe 279
  • [MRG] GSoC 2014: Standard Extreme Learning Machines

    [MRG] GSoC 2014: Standard Extreme Learning Machines

    Finished implementing the standard extreme learning machines (ELMs). I am getting the following results with 550 hidden neurons against the digits datasets,

    Training accuracy using the logistic activation function: 0.999444 Training accuracy using the tanh activation function: 1.000000

    Fortunately, this algorithm is much easier to implement and debug than multi-layer perceptron :). I will push a test file soon.

    @ogrisel , @larsmans

    Waiting for Reviewer 
    opened by IssamLaradji 267
  • [MRG] Add experimental.ColumnTransformer

    [MRG] Add experimental.ColumnTransformer

    Continuation @amueller's PR https://github.com/scikit-learn/scikit-learn/pull/3886 (for now just rebased and updated for changes in sklearn)

    Fixes #2034.

    Closes #2034, closes #3886, closes #8540, closes #8539

    opened by jorisvandenbossche 245
  • Enh/tree (performance optimised)

    Enh/tree (performance optimised)

    As @satra commented in https://github.com/scikit-learn/scikit-learn/pull/288#issuecomment-1691949, here is a separate PR for a decision tree. This version is significantly faster than the alternatives (Orange and MILK). Furthermore, Orange and scikits.learn support multiclass classification and regression, Milk only supports classification.

    We would now welcome any comments with the aim of gaining acceptance for a merge into master.

    Performance and scores

    $python bench_tree.py madelon Tree benchmarks

    Loading data ... Done, 2000 samples with 500 features loaded into memory scikits.learn (initial): mean 84.23, std 0.62 Score: 0.76

    scikits.learn (now): mean 0.65, std 0.00 Score: 0.76

    milk: mean 115.31, std 1.57 Score: 0.75

    Orange: mean 25.82, std 0.02 Score: 0.50

    $python bench_tree.py arcene Tree benchmarks

    Loading data ... Done, 100 samples with 10000 features loaded into memory scikits.learn (initial): mean 40.95, std 0.44 Score: 0.60

    scikits.learn (now): mean 0.20, std 0.00 Score: 0.60

    milk: mean 71.00, std 0.60 Score: 0.60

    Orange: mean 10.78, std 0.20 Score: 0.51

    TODO before merge

    • increase test coverage to over 95%
    • finish the documentation (fix broken example and plot links, add practical usage tips)
    • demonstrate how to use a graphviz output in an example
    • include a static grapvhiz output for the iris and boston datasets in the documentation
    • add feature_names to GraphvizExporter
    • extract the graphviz exporter code out of the tree classes (use visitor pattern), assign node numbers (not mem addresses)
    • s/dimension/feature/g
    • add a test for the pickling of a fitted tree
    • cythonise prediction
    • explain in the documentation and in the docstrings how these classes relate to ID3, C4.5 and CART

    Future enhancements

    • ability to provide instance weights (for boosting DTs)
    • support a loss matrix (ala R's rpart)
    • support multivariate regression (ala R's mvpart)
    • support Randomized Trees
    opened by bdholt1 245
  • [MRG+1] Clustering algorithm - BIRCH

    [MRG+1] Clustering algorithm - BIRCH

    Fixes https://github.com/scikit-learn/scikit-learn/issues/2690

    The design is similar to the Java code written here https://code.google.com/p/jbirch/ I am pretty much sure it works (If the JAVA implementation is correct, ofc), since I get the same clusters for both cases. I opened this as a Proof of Concept.

    This example has been modified, http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html

    When threshold is set to 3.0

    figure_1

    When threshold is set to 1.0

    figure_2

    TODO: A LOT

    • [x] Make it as fast as possible.
    • [x] Support for sparse matrices.
    • [x] Extensive testing.
    • [x] Make the common tests pass, for now I have added it to dont_test ;)
    • [x] Narrative documentation

    Awating some initial feedback!

    opened by MechCoder 232
  • Ensure that functions's docstrings pass numpydoc validation

    Ensure that functions's docstrings pass numpydoc validation

    Background / Objective

    Docstrings in Python are string literals that occur as the first statement in a module, function, class, or method definition.

    These are some of the characteristics of a docstring:

    • Triple quotes are used to encompass the docstring text.
    • There is no blank line before or after the docstring.
    • The docstring is a phrase ending in a period.
    • more details

    numpydoc is one set of criteria to check for consistent documentation structure.

    Validating docstrings in scikit-learn

    To ensure consistent documentation structure in scikit-learn, we are using numpydoc validation. Currently, documentation tests are failing for various functions. As a temporary fix, we have suppressed error messages in test_docstrings.py. Many of the functions in scikit-learn need to be updated to comply with numpy docstring validation. In the below issue, we provide step-by-step instructions on how contributors can test and update functions.

    Note

    For those who are running into "YD01: No Yields section found", it could be the cv parameter. Update An iterable yielding (train, test) splits as arrays of indices to:

            - An iterable that generates (train, test) splits as arrays of indices.
    

    Steps

    1. Make sure you have the development dependencies and documentation dependencies installed.
    2. Pick an function from the list below and leave a comment saying you are going to work on it. This way we can keep track of what everyone is working on. 2.1 Make sure you've created a separate branch from main before editing files for your new contribution. Refer to our contributing guidelines for more information.
    3. Remove the function from the list at: https://github.com/scikit-learn/scikit-learn/blob/670133dbc42ebd9f79552984316bc2fcfd208e2e/sklearn/tests/test_docstrings.py#L14
    4. Let's say you picked sklearn._config.config_context, run numpydoc validation as follows.
    pytest sklearn/tests/test_docstrings.py -k sklearn._config.config_context
    
    1. If you see failing test, please fix them by following the recommendation provided by the failing test.
    2. If you see all the tests past, you do not need to do any additional changes.
    3. Commit your changes.
    4. Open a Pull Request with an opening message Addresses #21350. Note that each item should be submitted in a separate Pull Request.
    5. Include the function name in the title of the pull request. For example: "DOC Ensures that config_context passes numpydoc validation".

    Note: once you have issued 3 such PRs, feel free to move on to contributing more complex pull requests that involve more thinking and leave those issue fixes to first time contributors for them to learn the github contribution workflow :)

    Functions to Update

    • [x] sklearn._config.config_context #21426
    • [x] sklearn._config.get_config #21656
    • [x] sklearn.base.clone #21557
    • [x] sklearn.cluster._affinity_propagation.affinity_propagation #21778
    • [x] sklearn.cluster._agglomerative.linkage_tree #21424
    • [x] sklearn.cluster._kmeans.k_means #21423
    • [x] sklearn.cluster._kmeans.kmeans_plusplus #22200
    • [x] sklearn.cluster._mean_shift.estimate_bandwidth #21940
    • [x] sklearn.cluster._mean_shift.get_bin_seeds #22018
    • [x] sklearn.cluster._mean_shift.mean_shift #22019
    • [ ] sklearn.cluster._optics.cluster_optics_dbscan
    • [x] sklearn.cluster._optics.cluster_optics_xi #22202
    • [x] sklearn.cluster._optics.compute_optics_graph #22024 #22205
    • [x] sklearn.cluster._spectral.spectral_clustering #22025
    • [x] sklearn.compose._column_transformer.make_column_transformer #22183
    • [x] sklearn.covariance._empirical_covariance.empirical_covariance #21439
    • [x] sklearn.covariance._empirical_covariance.log_likelihood #21438
    • [x] sklearn.covariance._graph_lasso.graphical_lasso #22326
    • [x] sklearn.covariance._robust_covariance.fast_mcd #22331
    • [x] sklearn.covariance._shrunk_covariance.ledoit_wolf #22496 #22798 #22748
    • [x] sklearn.covariance._shrunk_covariance.ledoit_wolf_shrinkage #22798 #22748
    • [x] sklearn.covariance._shrunk_covariance.shrunk_covariance #22798 #22260
    • [x] sklearn.datasets._base.get_data_home #22259
    • [x] sklearn.datasets._base.load_boston #22247
    • [x] sklearn.datasets._base.load_breast_cancer #22346
    • [x] sklearn.datasets._base.load_diabetes #21526
    • [x] sklearn.datasets._base.load_digits #22392
    • [x] sklearn.datasets._base.load_files #21727
    • [x] sklearn.datasets._base.load_iris #21760
    • [x] sklearn.datasets._base.load_linnerud #22484
    • [x] sklearn.datasets._base.load_sample_image #22805
    • [x] sklearn.datasets._base.load_wine #22469
    • [x] sklearn.datasets._california_housing.fetch_california_housing #22882
    • [x] sklearn.datasets._covtype.fetch_covtype #22918
    • [x] sklearn.datasets._kddcup99.fetch_kddcup99 #23929
    • [x] sklearn.datasets._lfw.fetch_lfw_pairs #23655
    • [x] sklearn.datasets._lfw.fetch_lfw_people #24161
    • [x] sklearn.datasets._olivetti_faces.fetch_olivetti_faces #22480
    • [x] sklearn.datasets._openml.fetch_openml #22483
    • [x] sklearn.datasets._rcv1.fetch_rcv1 #22225
    • [x] sklearn.datasets._samples_generator.make_biclusters #22790
    • [x] sklearn.datasets._samples_generator.make_blobs #22342
    • [x] sklearn.datasets._samples_generator.make_checkerboard #22390
    • [x] sklearn.datasets._samples_generator.make_classification #22797
    • [x] sklearn.datasets._samples_generator.make_gaussian_quantiles #23996
    • [x] sklearn.datasets._samples_generator.make_hastie_10_2 #22333
    • [x] sklearn.datasets._samples_generator.make_multilabel_classification #22784 #22782
    • [x] sklearn.datasets._samples_generator.make_regression #22784
    • [x] sklearn.datasets._samples_generator.make_sparse_coded_signal #22817
    • [x] sklearn.datasets._samples_generator.make_sparse_spd_matrix #22332
    • [x] sklearn.datasets._samples_generator.make_spd_matrix #23974
    • [x] sklearn.datasets._species_distributions.fetch_species_distributions #24162
    • [x] sklearn.datasets._svmlight_format_io.dump_svmlight_file #23166
    • [x] sklearn.datasets._svmlight_format_io.load_svmlight_file #24163 #24164
    • [x] sklearn.datasets._svmlight_format_io.load_svmlight_files #24164
    • [x] sklearn.datasets._twenty_newsgroups.fetch_20newsgroups #22329
    • [x] sklearn.decomposition._dict_learning.dict_learning #24316 #24289 #22793
    • [x] sklearn.decomposition._dict_learning.dict_learning_online #24289
    • [x] sklearn.decomposition._dict_learning.sparse_encode #22793
    • [x] sklearn.decomposition._fastica.fastica #23094
    • [x] sklearn.decomposition._nmf.non_negative_factorization #24235
    • [x] sklearn.externals._packaging.version.parse #24447 #24567 #24461 #24320 #22817 #22793 #22332
    • [x] sklearn.feature_extraction.image.extract_patches_2d #23926
    • [x] sklearn.feature_extraction.image.grid_to_graph #23052
    • [x] sklearn.feature_extraction.image.img_to_graph #23398
    • [x] sklearn.feature_extraction.text.strip_accents_ascii #23250
    • [x] sklearn.feature_extraction.text.strip_accents_unicode #24232
    • [x] sklearn.feature_extraction.text.strip_tags #23248
    • [x] sklearn.feature_selection._univariate_selection.chi2 #23945 #23943 #23467
    • [ ] sklearn.feature_selection._univariate_selection.f_oneway
    • [x] sklearn.feature_selection._univariate_selection.r_regression #22785
    • [x] sklearn.inspection._partial_dependence.partial_dependence #24325 #24174
    • [x] sklearn.inspection._plot.partial_dependence.plot_partial_dependence #24325
    • [x] sklearn.isotonic.isotonic_regression #22475
    • [x] sklearn.linear_model._least_angle.lars_path #24319 #22500
    • [x] sklearn.linear_model._least_angle.lars_path_gram #24319
    • [x] sklearn.linear_model._omp.orthogonal_mp #24329 #22501
    • [x] sklearn.linear_model._omp.orthogonal_mp_gram #24329
    • [x] sklearn.linear_model._ridge.ridge_regression #22788
    • [x] sklearn.manifold._locally_linear.locally_linear_embedding #24330
    • [x] sklearn.manifold._t_sne.trustworthiness #24333
    • [x] sklearn.metrics._classification.accuracy_score #24259 #21478 #21441
    • [x] sklearn.metrics._classification.balanced_accuracy_score #21478
    • [x] sklearn.metrics._classification.brier_score_loss #23914
    • [x] sklearn.metrics._classification.classification_report #22803
    • [x] sklearn.metrics._classification.cohen_kappa_score #23915
    • [x] sklearn.metrics._classification.confusion_matrix #22842 #21496
    • [x] sklearn.metrics._classification.f1_score #22358
    • [x] sklearn.metrics._classification.fbeta_score #23486
    • [x] sklearn.metrics._classification.hamming_loss #21449
    • [x] sklearn.metrics._classification.hinge_loss #23387
    • [x] sklearn.metrics._classification.jaccard_score #23910
    • [x] sklearn.metrics._classification.log_loss #23657
    • [x] sklearn.metrics._classification.precision_recall_fscore_support #22472
    • [x] sklearn.metrics._classification.precision_score #23504 #22712 #21479
    • [x] sklearn.metrics._classification.recall_score #21495
    • [x] sklearn.metrics._classification.zero_one_loss #21450
    • [x] sklearn.metrics._plot.confusion_matrix.plot_confusion_matrix #22842
    • [x] sklearn.metrics._plot.det_curve.plot_det_curve #24334
    • [x] sklearn.metrics._plot.precision_recall_curve.plot_precision_recall_curve #24403
    • [x] sklearn.metrics._plot.roc_curve.plot_roc_curve #21547
    • [x] sklearn.metrics._ranking.auc #23505 #23433
    • [x] sklearn.metrics._ranking.average_precision_score #23504 #22712
    • [x] sklearn.metrics._ranking.coverage_error #24322
    • [x] sklearn.metrics._ranking.dcg_score #24351 #22400
    • [x] sklearn.metrics._ranking.label_ranking_average_precision_score #23504
    • [x] sklearn.metrics._ranking.label_ranking_loss #22781
    • [x] sklearn.metrics._ranking.ndcg_score #22400
    • [x] sklearn.metrics._ranking.precision_recall_curve #24403 #22514
    • [x] sklearn.metrics._ranking.roc_auc_score #23505
    • [x] sklearn.metrics._ranking.roc_curve #24351 #21547
    • [x] sklearn.metrics._ranking.top_k_accuracy_score #24259
    • [x] sklearn.metrics._regression.max_error #21420
    • [x] sklearn.metrics._regression.mean_absolute_error #21714
    • [x] sklearn.metrics._regression.mean_pinball_loss #24336
    • [x] sklearn.metrics._scorer.make_scorer #22367
    • [x] sklearn.metrics.cluster._bicluster.consensus_score #24343
    • [x] sklearn.metrics.cluster._supervised.adjusted_mutual_info_score #24344
    • [x] sklearn.metrics.cluster._supervised.adjusted_rand_score #24345
    • [x] sklearn.metrics.cluster._supervised.completeness_score #23016
    • [x] sklearn.metrics.cluster._supervised.entropy #24352
    • [x] sklearn.metrics.cluster._supervised.fowlkes_mallows_score #24352
    • [x] sklearn.metrics.cluster._supervised.homogeneity_completeness_v_measure #23942
    • [x] sklearn.metrics.cluster._supervised.homogeneity_score #23006
    • [x] sklearn.metrics.cluster._supervised.mutual_info_score #24344 #24093 #24091
    • [x] sklearn.metrics.cluster._supervised.normalized_mutual_info_score #24093
    • [x] sklearn.metrics.cluster._supervised.pair_confusion_matrix #24094
    • [x] sklearn.metrics.cluster._supervised.rand_score #24345 #24096
    • [x] sklearn.metrics.cluster._supervised.v_measure_score #24097
    • [x] sklearn.metrics.cluster._unsupervised.davies_bouldin_score #21850
    • [x] sklearn.metrics.cluster._unsupervised.silhouette_samples #21851
    • [x] sklearn.metrics.cluster._unsupervised.silhouette_score #21852
    • [x] sklearn.metrics.pairwise.additive_chi2_kernel #23943
    • [x] sklearn.metrics.pairwise.check_paired_arrays #23944
    • [x] sklearn.metrics.pairwise.check_pairwise_arrays #23519
    • [x] sklearn.metrics.pairwise.chi2_kernel #23945 #23943
    • [x] sklearn.metrics.pairwise.cosine_distances #23946 #22141
    • [x] sklearn.metrics.pairwise.cosine_similarity #23947
    • [x] sklearn.metrics.pairwise.distance_metrics #23949
    • [x] sklearn.metrics.pairwise.euclidean_distances #22783 #22140 #21429
    • [x] sklearn.metrics.pairwise.haversine_distances #23044
    • [x] sklearn.metrics.pairwise.kernel_metrics #23950
    • [x] sklearn.metrics.pairwise.laplacian_kernel #23005
    • [x] sklearn.metrics.pairwise.linear_kernel #21470
    • [x] sklearn.metrics.pairwise.manhattan_distances #23900 #22139
    • [x] sklearn.metrics.pairwise.nan_euclidean_distances #22140
    • [x] sklearn.metrics.pairwise.paired_cosine_distances #22141
    • [x] sklearn.metrics.pairwise.paired_distances #22380
    • [x] sklearn.metrics.pairwise.paired_euclidean_distances #22783
    • [x] sklearn.metrics.pairwise.paired_manhattan_distances #23900
    • [x] sklearn.metrics.pairwise.pairwise_distances_argmin #23951 #23952
    • [x] sklearn.metrics.pairwise.pairwise_distances_argmin_min #23952
    • [x] sklearn.metrics.pairwise.pairwise_distances_chunked #24527
    • [ ] sklearn.metrics.pairwise.pairwise_kernels
    • [x] sklearn.metrics.pairwise.polynomial_kernel #23953
    • [x] sklearn.metrics.pairwise.rbf_kernel #23954
    • [x] sklearn.metrics.pairwise.sigmoid_kernel #23955
    • [x] sklearn.model_selection._split.check_cv #22778
    • [x] sklearn.model_selection._split.train_test_split #21435
    • [x] sklearn.model_selection._validation.cross_val_predict #21433
    • [x] sklearn.model_selection._validation.cross_val_score #21464
    • [x] sklearn.model_selection._validation.cross_validate #23145
    • [x] sklearn.model_selection._validation.learning_curve #23911
    • [x] sklearn.model_selection._validation.permutation_test_score #23912
    • [x] sklearn.model_selection._validation.validation_curve #23913
    • [x] sklearn.neighbors._graph.kneighbors_graph #22459
    • [x] sklearn.neighbors._graph.radius_neighbors_graph #22462
    • [x] sklearn.pipeline.make_union #23909
    • [x] sklearn.preprocessing._data.binarize #24002 #22801
    • [x] sklearn.preprocessing._data.maxabs_scale #24359
    • [x] sklearn.preprocessing._data.normalize #24093 #23188 #22795
    • [x] sklearn.preprocessing._data.power_transform #22802
    • [x] sklearn.preprocessing._data.quantile_transform #22780
    • [x] sklearn.preprocessing._data.robust_scale #23908
    • [x] sklearn.preprocessing._data.scale #24362 #24359 #23908
    • [x] sklearn.preprocessing._label.label_binarize #24002
    • [x] sklearn.random_projection.johnson_lindenstrauss_min_dim #24003
    • [x] sklearn.svm._bounds.l1_min_c #24134
    • [ ] sklearn.tree._export.plot_tree
    • [x] sklearn.utils.axis0_safe_slice #24561
    • [x] sklearn.utils.check_pandas_support #21566
    • [x] sklearn.utils.extmath.cartesian #21513
    • [x] sklearn.utils.extmath.density #24516
    • [x] sklearn.utils.extmath.fast_logdet #24605
    • [x] sklearn.utils.extmath.randomized_range_finder #22069
    • [x] sklearn.utils.extmath.randomized_svd #24607
    • [x] sklearn.utils.extmath.safe_sparse_dot #24567
    • [x] sklearn.utils.extmath.squared_norm #24360
    • [x] sklearn.utils.extmath.stable_cumsum #24348
    • [x] sklearn.utils.extmath.svd_flip #24581
    • [x] sklearn.utils.extmath.weighted_mode #24571
    • [ ] sklearn.utils.fixes.delayed
    • [x] sklearn.utils.fixes.linspace #24582
    • [ ] sklearn.utils.fixes.threadpool_info
    • [ ] sklearn.utils.fixes.threadpool_limits
    • [x] sklearn.utils.gen_batches #24609
    • [x] sklearn.utils.gen_even_slices #24608
    • [x] sklearn.utils.get_chunk_n_rows #22539
    • [ ] sklearn.utils.graph.graph_shortest_path
    • [x] sklearn.utils.graph.single_source_shortest_path_length #24474
    • [x] sklearn.utils.is_scalar_nan #24562
    • [x] sklearn.utils.metaestimators.available_if #24586
    • [x] sklearn.utils.metaestimators.if_delegate_has_method #24633
    • [x] sklearn.utils.multiclass.check_classification_targets #22793
    • [x] sklearn.utils.multiclass.class_distribution #24452
    • [x] sklearn.utils.multiclass.type_of_target #24463
    • [x] sklearn.utils.multiclass.unique_labels #24476
    • [x] sklearn.utils.resample #23916
    • [x] sklearn.utils.safe_mask #24425
    • [x] sklearn.utils.safe_sqr #24437
    • [x] sklearn.utils.shuffle #24367
    • [x] sklearn.utils.sparsefuncs.count_nonzero #24447
    • [x] sklearn.utils.sparsefuncs.csc_median_axis_0 #24461
    • [x] sklearn.utils.sparsefuncs.incr_mean_variance_axis #24477
    • [x] sklearn.utils.sparsefuncs.inplace_swap_column #23476
    • [x] sklearn.utils.sparsefuncs.inplace_swap_row #24518 #24513 #24178
    • [x] sklearn.utils.sparsefuncs.inplace_swap_row_csc #24513
    • [x] sklearn.utils.sparsefuncs.inplace_swap_row_csr #24518
    • [x] sklearn.utils.sparsefuncs.mean_variance_axis #24477 #24177
    • [x] sklearn.utils.sparsefuncs.min_max_axis #22839
    • [x] sklearn.utils.tosequence #22494
    • [x] sklearn.utils.validation.as_float_array #21502
    • [x] sklearn.utils.validation.assert_all_finite #22470
    • [x] sklearn.utils.validation.check_is_fitted #24454
    • [x] sklearn.utils.validation.check_memory #23039
    • [x] sklearn.utils.validation.check_random_state #23320 #22787
    • [x] sklearn.utils.validation.column_or_1d #21591
    • [x] sklearn.utils.validation.has_fit_parameter #21590
    • [x] sklearn.utils.validation.indexable #21431
    Documentation Sprint good first issue Meta-issue 
    opened by thomasjpfan 215
  • Ensure that docstrings pass numpydoc validation

    Ensure that docstrings pass numpydoc validation

    1. Make sure you have the development dependencies and documentation dependencies installed.
    2. Pick an estimator from the list below and leave a comment saying you are going to work on it. This way we can keep track of what everyone is working on.
    3. Remove the estimator from the list at: https://github.com/scikit-learn/scikit-learn/blob/bb6117b228e2940cada2627dce86b49d0662220c/maint_tools/test_docstrings.py#L11
    4. Let's say you picked StandardScaler, run numpydoc validation as follows (Adding the - at the end helps with the regex).
    pytest maint_tools/test_docstrings.py -k StandardScaler- 
    
    1. If you see failing test, please fix them by following the recommendation provided by the failing test.
    2. If you see all the tests past, you do not need to do any additional changes.
    3. Commit your changes.
    4. Open a Pull Request with an opening message Addresses #20308. Note that each item should be submitted in a separate Pull Request.
    5. Include the estimator name in the title of the pull request. For example: "DOC Ensures that StandardScaler passes numpydoc validation".
    • [x] #20381 ARDRegression
    • [x] #20374 AdaBoostClassifier
    • [x] #20400 AdaBoostRegressor
    • [x] #20536 AdditiveChi2Sampler
    • [x] #20532 AffinityPropagation
    • [x] #20544 AgglomerativeClustering
    • [x] #20407 BaggingClassifier
    • [x] #20498 BaggingRegressor
    • [x] #20384 BayesianGaussianMixture
    • [x] #20389 BayesianRidge
    • [x] BernoulliNB
    • [x] #20533 BernoulliRBM
    • [x] #20422 Binarizer
    • [x] Birch
    • [x] #20504 CCA
    • [x] CalibratedClassifierCV
    • [x] #20445 CategoricalNB
    • [x] ClassifierChain
    • [x] ColumnTransformer
    • [x] #20440 ComplementNB
    • [x] #20403 CountVectorizer
    • [x] #20375 DBSCAN
    • [x] #20399 DecisionTreeClassifier
    • [x] DecisionTreeRegressor
    • [x] DictVectorizer
    • [x] DictionaryLearning
    • [x] DummyClassifier
    • [x] #20394 DummyRegressor
    • [x] #20454 ElasticNet
    • [x] ElasticNetCV
    • [x] #20548 EllipticEnvelope
    • [x] #20551 EmpiricalCovariance
    • [x] ExtraTreeClassifier
    • [x] ExtraTreeRegressor
    • [x] ExtraTreesClassifier
    • [x] ExtraTreesRegressor
    • [x] FactorAnalysis
    • [x] #20405 FastICA
    • [x] FeatureAgglomeration
    • [x] FeatureHasher
    • [x] FeatureUnion
    • [x] FunctionTransformer
    • [x] GammaRegressor
    • [x] GaussianMixture
    • [x] #20440 GaussianNB
    • [x] GaussianProcessClassifier
    • [x] GaussianProcessRegressor
    • [x] GaussianRandomProjection
    • [x] #20495 GenericUnivariateSelect
    • [x] GradientBoostingClassifier
    • [x] GradientBoostingRegressor
    • [x] #20527 GraphicalLasso
    • [x] #20546 GraphicalLassoCV
    • [x] GridSearchCV
    • [x] HalvingGridSearchCV
    • [x] HalvingRandomSearchCV
    • [x] HashingVectorizer
    • [x] HistGradientBoostingClassifier
    • [x] HistGradientBoostingRegressor
    • [x] HuberRegressor
    • [x] IncrementalPCA
    • [x] https://github.com/scikit-learn/scikit-learn/pull/20437 IsolationForest
    • [x] Isomap
    • [x] #20514 IsotonicRegression
    • [x] IterativeImputer
    • [x] KBinsDiscretizer
    • [x] #20377 KMeans
    • [x] KNNImputer
    • [x] #20373 KNeighborsClassifier
    • [x] #20378 KNeighborsRegressor
    • [x] KNeighborsTransformer
    • [x] KernelCenterer
    • [x] KernelDensity
    • [x] KernelPCA
    • [x] KernelRidge
    • [x] LabelBinarizer
    • [x] #20456 LabelEncoder
    • [x] LabelPropagation
    • [x] LabelSpreading
    • [x] #20472 Lars
    • [x] #20501 LarsCV
    • [x] #20409 Lasso
    • [x] #20453 LassoCV
    • [x] #20459 LassoLars
    • [x] #20462 LassoLarsCV
    • [x] #20465 LassoLarsIC
    • [x] #20402 LatentDirichletAllocation
    • [x] #20578 LedoitWolf
    • [x] LinearDiscriminantAnalysis
    • [x] #20369 LinearRegression
    • [x] #20458 LinearSVC
    • [x] LinearSVR
    • [x] LocalOutlierFactor
    • [x] LocallyLinearEmbedding
    • [x] #20370 LogisticRegression
    • [x] #20376 LogisticRegressionCV
    • [x] MDS
    • [x] #20444 MLPClassifier
    • [x] MLPRegressor
    • [x] #20455 MaxAbsScaler
    • [x] MeanShift
    • [x] #20580 MinCovDet
    • [x] MinMaxScaler
    • [x] MiniBatchDictionaryLearning
    • [x] MiniBatchKMeans
    • [x] MiniBatchSparsePCA
    • [x] MissingIndicator
    • [x] MultiLabelBinarizer
    • [x] MultiOutputClassifier
    • [x] MultiOutputRegressor
    • [x] MultiTaskElasticNet
    • [x] MultiTaskElasticNetCV
    • [x] MultiTaskLasso
    • [x] MultiTaskLassoCV
    • [x] #20440 MultinomialNB
    • [x] NMF
    • [x] NearestCentroid
    • [x] #20446 NearestNeighbors
    • [x] NeighborhoodComponentsAnalysis
    • [x] Normalizer
    • [x] #20461 NuSVC
    • [x] NuSVR
    • [x] Nystroem
    • [x] #20579 OAS
    • [x] OPTICS
    • [x] #20463 OneClassSVM
    • [x] #20406 OneHotEncoder
    • [x] OneVsOneClassifier
    • [x] OneVsRestClassifier
    • [x] OrdinalEncoder
    • [x] OrthogonalMatchingPursuit
    • [x] OrthogonalMatchingPursuitCV
    • [x] OutputCodeClassifier
    • [x] PCA
    • [x] PLSCanonical
    • [x] PLSRegression
    • [x] PLSSVD
    • [x] PassiveAggressiveClassifier
    • [x] PassiveAggressiveRegressor
    • [x] PatchExtractor
    • [x] #20404 Perceptron
    • [x] Pipeline
    • [x] #20386 PoissonRegressor
    • [x] PolynomialCountSketch
    • [x] PolynomialFeatures
    • [x] PowerTransformer
    • [x] QuadraticDiscriminantAnalysis
    • [x] QuantileRegressor
    • [x] QuantileTransformer
    • [x] RANSACRegressor
    • [x] RBFSampler
    • [x] #20419 RFE
    • [x] #20452 RFECV
    • [x] RadiusNeighborsClassifier
    • [x] RadiusNeighborsRegressor
    • [x] RadiusNeighborsTransformer
    • [x] #20383 RandomForestClassifer
    • [x] RandomForestRegressor
    • [x] RandomTreesEmbedding
    • [x] RandomizedSearchCV
    • [x] RegressorChain
    • [x] #20499 Ridge
    • [x] #20503 RidgeCV
    • [x] RidgeClassifier
    • [x] RidgeClassifierCV
    • [x] RobustScaler
    • [x] SGDOneClassSVM
    • [x] SGDRegressor
    • [x] #20457 SVC
    • [x] SVR
    • [x] SelectFdr
    • [x] SelectFpr
    • [x] SelectFromModel
    • [x] SelectFwe
    • [x] SelectKBest
    • [x] SelectPercentile
    • [x] #21277 SelfTrainingClassifier
    • [x] SequentialFeatureSelector
    • [x] #20571 ShrunkCovariance
    • [x] SimpleImputer
    • [x] SkewedChi2Sampler
    • [x] SparseCoder
    • [x] #20395 SparsePCA
    • [x] SparseRandomProjection
    • [x] SpectralBiclustering
    • [x] SpectralClustering
    • [x] SpectralCoclustering #21463
    • [x] SpectralEmbedding
    • [x] SplineTransformer
    • [x] StackingClassifier
    • [x] StackingRegressor
    • [x] #20368 StandardScalar
    • [x] TSNE
    • [x] #20379 TfidfVectorizer
    • [x] TheilSenRegressor
    • [x] TransformedTargetRegressor
    • [x] TruncatedSVD
    • [x] TweedieRegressor
    • [x] VarianceThreshold
    • [x] VotingClassifier
    • [x] #20450 VotingRegressor
    Documentation Sprint good first issue 
    opened by thomasjpfan 212
  • [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection

    [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection

    fixes #1848, #2904

    Things to be done after this PR - Issue at #5053 ( PR at #5569 )


    TODO

    • [x] Make all cross-validators data-independent (Don't initialize cross-validators with data/data-dependent parameters)
    • [x] Reorganize the classes / functions into the model_selection module.
    • [x] Remove all deprecation from new style classes.
    • [x] Use split in all the files that use cv ? (or fix check_cv so both old style and new style classes can be used)
    • [x] Refactor the tests.
    • [x] Fix all the imports
    • [x] Make all the general tests pass
    • [x] Make all the model_selction tests pass
    • [x] Conclude all discussions into workable solutions/fixes.
    • [x] Clean up the old style classes to use as much from the new members as possible. (duplicate as little code as possible) - https://github.com/rvraghav93/scikit-learn/pull/2 - We don't want this in until this PR gets to master! - (See #5568)
    • [x] Fix the examples - ~~https://github.com/rvraghav93/scikit-learn/pull/3~~ merged into https://github.com/rvraghav93/scikit-learn/pull/4!
    • [x] Clean up the documentation - https://github.com/rvraghav93/scikit-learn/pull/4

    MINOR

    • [x] ~~Rename p to a better name.~~ Moved to #5053
    • [x] Rename _check_is_partition-->_check_is_permutation?
    • [x] Remove _empty_mask
    • [x] As Joel said here, use binomial coefficient instead of factorial.

    Open Discussions

    • [x] Order of labels arg - Refer discussion here https://github.com/scikit-learn/scikit-learn/pull/4294#discussion_r34417412 and here https://github.com/scikit-learn/scikit-learn/pull/4294#discussion_r34417532 - labels added to the last.
    • [x] Whether we would want to reshuffle the data at each split call - Refer https://github.com/scikit-learn/scikit-learn/pull/4294#issuecomment-127081590 - Attempted to be fixed by commit "ENH+TST make rng to be generated at every split call for reproducibility" - Previously (at the time of @amueller's review) successive split calls generated different splits since rng was generated only once at the __init__... - Now successive split calls return similar results (when random_state is set)
    • [x] Making the submodules (validation et al) private? - Refer https://github.com/scikit-learn/scikit-learn/pull/4294#issuecomment-127116408 - Made private with 2 votes from @vene, @amueller
    • [x] Can we safely pass labels to the inner cv in permutation_test_score - Refer https://github.com/scikit-learn/scikit-learn/pull/4294#issuecomment-117219334 - ping @agramfort - For now yes. - (https://github.com/scikit-learn/scikit-learn/pull/4294#issuecomment-127370265)
    • [x] Make CVIteratorWrapper private? - Refer https://github.com/scikit-learn/scikit-learn/pull/4294#discussion_r35801766 - Made private
    • [x] Deprecation window - Should we keep the old code till 1.0 or 0.18 or ?? - https://github.com/scikit-learn/scikit-learn/pull/4294#issuecomment-126086788 - https://github.com/scikit-learn/scikit-learn/pull/4294#issuecomment-127374819 - Have updated to 0.19
    • [x] https://github.com/scikit-learn/scikit-learn/pull/4294/files#r35808353 - Waiting for reply
    • [x] Better way to test working of shuffle in (Stratified)KFold - Refer https://github.com/scikit-learn/scikit-learn/pull/4294#discussion_r33640279

    NOTE: The current implementation will still not support nesting EstimatorCV inside GridSearchCV... This will become possible only after allowing sample_properties to pass from GridSearchCV to EstimatorCV...


    PRs whose changes to g_s / c_v / l_c have been manually incorporated into this: #4714 - Svm decision function shape - 1 commit #4829 - merge _check_cv into check_cv ... - 1 commit #4857 - Document missing attribute random_state in RandomizedSearchCV - 1 commit #4840 - FIX avoid memory cost when sampling from large parameter grids - 1 commit #5194 (Refer #5238) - Consistent CV docstring #5161 check for sparse pred in cross_val_predict #5201 clarify random state in KFold #5190 LabelKFold #4583 LabelShuffleSplit #5283 Remove some warnings in grid search tests #5300 shuffle labels not idxs and tests to ensure it.


    This PR is slightly based upon @pignacio's work in #3340.


    @amueller's hack: if you want to align diffs you can do this (in ipython notebook)

    import inspect
    import difflib
    from IPython.display import HTML
    
    def show_func_diff(func_a, func_b):
        return HTML(difflib.HtmlDiff().make_file(inspect.getsourcelines(func_a)[0], inspect.getsourcelines(func_b)[0]))
    
    from sklearn.cross_validation import cross_val_score as cross_val_score_old
    from sklearn.model_selection import cross_val_score
    
    show_func_diff(cross_val_score, cross_val_score_old)
    
    opened by raghavrv 207
  • [MRG] Multi-layer perceptron (MLP)

    [MRG] Multi-layer perceptron (MLP)

    Multi-layer perceptron (MLP)

    PR closed in favor or #3204

    mlp

    This is an extention to larsmans code.

    A multilayer perceptron (MLP) is a feedforward artificial neural network model that tries to learn a function f(X)=y where y is the output and X is the input. An MLP consists of multiple layers, usually of one hidden layer, an input layer and an output layer, where each layer is fully connected to the next one. This is a classic algorithm that has been extensively used in Neural Networks.

    Code Check out :

    1. git clone https://github.com/scikit-learn/scikit-learn
    2. cd scikit-learn/
    3. git fetch origin refs/pull/2120/head:mlp
    4. git checkout mlp

    Tutorial link:

    - http://easymachinelearning.blogspot.com/p/multi-layer-perceptron-tutorial.html

    Sample Benchmark:

    - `MLP` on the scikit's `Digits` dataset gives, - Score for `tanh-based sgd`: 0.981 - Score for `logistic-based sgd`: 0.987 - Score for `tanh-based l-bfgs`: 0.994 - Score for `logistic-based l-bfgs`: 1.000

    TODO:

    - Review
    opened by IssamLaradji 207
  • [MRG+2] Basic version of MICE Imputation

    [MRG+2] Basic version of MICE Imputation

    Reference Issue

    This is in reference to #7840, and builds on #7838.

    Fixes #7840.

    This code provides basic MICE imputation functionality. It currently only uses Bayesian linear regression as the prediction model. Once this is merged, I will add predictive mean matching (slower but sometimes better). See here for a reference: https://stat.ethz.ch/education/semesters/ss2012/ams/paper/mice.pdf

    opened by sergeyf 203
  • [MRG] add lobpcg svd_solver to PCA and TruncatedSVD

    [MRG] add lobpcg svd_solver to PCA and TruncatedSVD

    Reference Issues/PRs

    fixes #12079, fixes #12080

    What does this implement/fix? Explain your changes.

    #12079 adds LOBPCG as an SVD solver in PCA #12080 adds LOBPCG solver to Truncated PCA

    lobpcg_svd should also be useful in KernelPCA for faster partial decompositions, see #12068

    This PR also includes multiple LOBPCG related bug fixes, including vendoring sklearn/externals/_lobpcg.py from scipy 1.3.0

    Any other comments?

    @ogrisel Transferred from permanently closed PR #12291

    Keep in mind for testing, that lobpcg_svd falls back to dense eigensolver unless n_components < 3*matrix_size, where matrix_size = min (n_samples, n_features)

    Still to do, better in new focused PRs after this one is merged

    1. example plot_faces_decomposition may include lobpcg_svd, just change

      ('Eigenfaces - PCA using randomized SVD', decomposition.PCA(n_components=n_components, svd_solver='randomized', whiten=True), True),

    to

    ('Eigenfaces - PCA using randomized SVD',
     decomposition.PCA(n_components=n_components, svd_solver='lobpcg',
                       whiten=True),
     True),
    

    but lobpcg currently fails here for unclear numerical reasons. More testing may be needed for float32 data, like in this example.

    1. All four existing TruncatedSVD examples of scikit-learn in the examples/ folder do run with lobpcg, just by adding the option ", algorithm='lobpcg' " to TruncatedSVD function call. But none generates the matrix large enough to demonstrate the practical benefits of lobpcg_svd.
    Needs Decision module:decomposition 
    opened by lobpcg 174
  • ENH Adds Target Regression Encoder

    ENH Adds Target Regression Encoder

    Reference Issues/PRs

    Closes https://github.com/scikit-learn/scikit-learn/pull/5853 Closes https://github.com/scikit-learn/scikit-learn/pull/9614 Supersedes https://github.com/scikit-learn/scikit-learn/pull/17323

    What does this implement/fix? Explain your changes.

    This PR implements a target encoder which uses CV during fit_transform to prevent the target from leaking. transform uses the the target encoding from all the training data. This means that fit_transform() != fit().transform().

    The implementation uses Cython to learn the encoding which provides a 10x speed up compared to using a pure Python+NumPy approach. Cython is required because many encodings are learn during cross validation in fit_transform.

    Any other comments?

    The implementation uses the same scheme as cuML's TargetEncoder, which they used to win Recsys2020.

    module:preprocessing cython 
    opened by thomasjpfan 0
  • Read only buffer in cross_val_score with sparse matrix.

    Read only buffer in cross_val_score with sparse matrix.

    Describe the bug

    When calling cross_val_score with a sparse data matrix X and a RandomForestClassifier with n_jobs=-1, there is a weird interaction with joblib and memmapping that makes the buffer from X read-only, breaking the cython code for the tree construction but it is weird as it only appears with the cross_validate function, and not when calling the classifier alone, while n_jobs=1 for the cross val function so joblib should not enter the play here...

    Steps/Code to Reproduce

    from scipy.sparse import csr_matrix
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import cross_val_score
    
    X, y = make_classification(10000, n_features=200)
    X = csr_matrix(X, copy=True)
    
    clf = RandomForestClassifier(n_jobs=-1)
    
    cross_val_score(clf, X, y)
    

    Expected Results

    Working code

    Actual Results

    ValueError: 
    All the 5 fits failed.
    It is very likely that your model is misconfigured.
    You can try to debug the error by setting error_score='raise'.
    
    Below are more details about the failures:
    --------------------------------------------------------------------------------
    5 fits failed with the following error:
    joblib.externals.loky.process_executor._RemoteTraceback: 
    """
    Traceback (most recent call last):
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 428, in _process_worker
        r = call_item()
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 275, in __call__
        return self.fn(*self.args, **self.kwargs)
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 620, in __call__
        return self.func(*args, **kwargs)
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/joblib/parallel.py", line 288, in __call__
        return [func(*args, **kwargs)
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/joblib/parallel.py", line 288, in <listcomp>
        return [func(*args, **kwargs)
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/sklearn/utils/fixes.py", line 117, in __call__
        return self.function(*args, **kwargs)
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/sklearn/ensemble/_forest.py", line 185, in _parallel_build_trees
        tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/sklearn/tree/_classes.py", line 889, in fit
        super().fit(
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/sklearn/tree/_classes.py", line 379, in fit
        builder.build(self.tree_, X, y, sample_weight)
      File "sklearn/tree/_tree.pyx", line 147, in sklearn.tree._tree.DepthFirstTreeBuilder.build
      File "sklearn/tree/_tree.pyx", line 173, in sklearn.tree._tree.DepthFirstTreeBuilder.build
      File "sklearn/tree/_splitter.pyx", line 789, in sklearn.tree._splitter.BaseSparseSplitter.init
      File "stringsource", line 660, in View.MemoryView.memoryview_cwrapper
      File "stringsource", line 350, in View.MemoryView.memoryview.__cinit__
    ValueError: buffer source array is read-only
    """
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
        estimator.fit(X_train, y_train, **fit_params)
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/sklearn/ensemble/_forest.py", line 474, in fit
        trees = Parallel(
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/joblib/parallel.py", line 1098, in __call__
        self.retrieve()
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/joblib/parallel.py", line 975, in retrieve
        self._output.extend(job.get(timeout=self.timeout))
      File "/home/temp/.local/miniconda/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
        return future.result(timeout=timeout)
      File "/home/temp/.local/miniconda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
        return self.__get_result()
      File "/home/temp/.local/miniconda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
        raise self._exception
    ValueError: buffer source array is read-only
    

    Versions

    System:
        python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
    executable: /home/temp/.local/miniconda/bin/python3.10
       machine: Linux-5.14.0-1054-oem-x86_64-with-glibc2.31
    
    Python dependencies:
          sklearn: 1.2.0
              pip: 22.3
       setuptools: 65.5.0
            numpy: 1.23.5
            scipy: 1.9.3
           Cython: 0.29.32
           pandas: 1.5.2
       matplotlib: 3.6.2
           joblib: 1.2.0
    threadpoolctl: 3.1.0
    
    Built with OpenMP: True
    
    threadpoolctl info:
           user_api: blas
       internal_api: mkl
             prefix: libmkl_rt
           filepath: /home/temp/.local/miniconda/lib/libmkl_rt.so.2
            version: 2022.1-Product
    threading_layer: intel
        num_threads: 4
    
           user_api: openmp
       internal_api: openmp
             prefix: libomp
           filepath: /home/temp/.local/miniconda/lib/libomp.so
            version: None
        num_threads: 8
    
    Bug Needs Triage 
    opened by tomMoral 0
  • scikit-learn.com domain

    scikit-learn.com domain

    I recently came into possession of http://scikit-learn.com and since I am very generous, I want to donate this domain to the scikit-learn team free of charge.

    I think it would be easiest if an important scikit-learn person made an account at https://www.namecheap.com because domain transfer within the namecheap platform is much easier than between different platforms. From there, you can then transfer the domain somewhere else if you wish.

    Once you have decided on who is to receive the domain, you can comment here with the corresponding namecheap username and I will then transfer ownership of the domain to this account.

    It might be possible that a random person tries to impersonate a scipy author to steal the domain. To prevent that, a few important people (not quite sure who is important, perhaps people listed at https://scikit-learn.org/stable/about.html ?) should comment here to confirm that the chosen recipient is genuine and that there is some agreement within the scikit-learn community that this person should indeed receive the domain.

    For private communication, you can mail me at

    import base64
    print(base64.b64decode("dGhvbWFzLmdlcm1lckBoaHUuZGU=").decode("ascii"))
    

    but for the reasons mentioned above, I will not transfer the domain to random people just based on mails.

    Needs Triage 
    opened by 99991 0
  • DOC fix typo in euclidean_distances in `metrics/pairwise.py`

    DOC fix typo in euclidean_distances in `metrics/pairwise.py`

    https://github.com/scikit-learn/scikit-learn/blob/9e08ed2279c80407f1d4c92a27279f73a2d08bb2/sklearn/metrics/pairwise.py#L280

    Remove s from betweens.

    Distances betweens pairs of elements of X and Y. to Distances between pairs of elements of X and Y.

    Reference Issues/PRs

    NA

    What does this implement/fix? Explain your changes.

    Fixes a typo.

    Any other comments?

    Thank you for your time to review this PR!

    Documentation module:metrics 
    opened by 99991 0
  • [WIP] FIX NearestNeighbors-like classes with metric=

    [WIP] FIX NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values

    This PR fixes #25319.

    As suggested by @glemaitre, I changed the X, y validation of ._fit and then of .kneighbors and .radius_neighbors when metric="nan_euclidean" of RadiusNeighborsMixin, KNeighborsMixin, NeighborsBase. Consequently, changing the behavior of its heritage (KNeighborsTransformer, RadiusNeighborsTransformer,KNeighborsClassifier, RadiusNeighborsClassifier, LocalOutlierFactor, KNeighborsRegressor, RadiusNeighborsRegressor, NearestNeighbors).

    I also updated the NearestCentroid class to follow this new validation. To make it work I had to change the validation of sklearn.metrics.pairwise_distances_argmin and sklearn.metrics.pairwise_distances_argmin_min as well (updating the docs now that it supports metric=nan_euclidean').

    As KernelDensity uses kd_tree or ball_tree to build index: https://github.com/scikit-learn/scikit-learn/blob/98cf537f5c538fdbc9d27b851cf03ce7611b8a48/sklearn/neighbors/_kde.py#L48 It does not support metrics='nan_euclidean', and I made no changes to it.

    from sklearn.neighbors import VALID_METRICS
    for key in VALID_METRICS.keys():
        print(f"'nan_euclidean' in {key}:", 'nan_euclidean' in VALID_METRICS[key])
    >>> 'nan_euclidean' in ball_tree: False
    >>> 'nan_euclidean' in kd_tree: False
    >>> 'nan_euclidean' in brute: True
    

    Also, I added a test with the code used to report the issue by checking the behavior of the above classes.


    This is a WIP PR as I was not able to run black/tests on my machine and will use CI/CD for it.

    module:metrics module:neighbors 
    opened by vitaliset 0
  • DOC: fix color maps of contour and scatter plots in the plot_kernel_approximation.py example.

    DOC: fix color maps of contour and scatter plots in the plot_kernel_approximation.py example.

    Reference Issues/PRs

    No

    What does this implement/fix? Explain your changes.

    The current visualization for decision surfaces and data points seems to have some inconsistencies.

    current_version

    see also, https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_approximation.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-approximation-py

    Following are some examples of inconsistencies.

    1. The lower area of the "SVC with rbf kernel" plot (Left) is painted light-red, but the same area of other plots (Center, Right) is colored green. Note that these surfaces belong to the same label(=4).
    2. The purple area (lower left in each plot) covers yellow and blue points. It is difficult to tell if the purple area correctly classifies these yellow or blue points or none of them because of inconsistency between the colors of the surfaces and the points.

    The new version tried to fix these problems. See the figure below.

    new_version

    Any other comments?

    Documentation 
    opened by i-aki-y 0
Releases(1.2.0)
  • 1.2.0(Dec 8, 2022)

    We're happy to announce the 1.2.0 release.

    You can read the release highlights under https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_2_0.html and the long version of the change log under https://scikit-learn.org/stable/whats_new/v1.2.html

    This version supports Python versions 3.8 to 3.11.

    Source code(tar.gz)
    Source code(zip)
  • 1.1.3(Oct 26, 2022)

    We're happy to announce the 1.1.3 release.

    This bugfix release only includes fixes for compatibility with the latest SciPy release >= 1.9.2 and wheels for Python 3.11. Note that support for 32-bit Python on Windows has been dropped in this release. This is due to the fact that SciPy 1.9.2 also dropped the support for that platform. Windows users are advised to install the 64-bit version of Python instead.

    You can see the changelog here: https://scikit-learn.org/dev/whats_new/v1.1.html#version-1-1-3

    You can upgrade with pip as usual:

    pip install -U scikit-learn
    

    The conda-forge builds will be available shortly, which you can then install using:

    conda install -c conda-forge scikit-learn
    
    Source code(tar.gz)
    Source code(zip)
  • 1.1.2(Aug 5, 2022)

    We're happy to announce the 1.1.2 release with several bugfixes:

    You can see the changelog here: https://scikit-learn.org/dev/whats_new/v1.1.html#version-1-1-2

    You can upgrade with pip as usual:

    pip install -U scikit-learn
    

    The conda-forge builds will be available shortly, which you can then install using:

    conda install -c conda-forge scikit-learn
    
    Source code(tar.gz)
    Source code(zip)
  • 1.1.1(May 19, 2022)

    We're happy to announce the 1.1.1 release with several bugfixes:

    You can see the changelog here: https://scikit-learn.org/dev/whats_new/v1.1.html#version-1-1-1

    You can upgrade with pip as usual:

    pip install -U scikit-learn
    

    The conda-forge builds will be available shortly, which you can then install using:

    conda install -c conda-forge scikit-learn
    
    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(May 12, 2022)

    We're happy to announce the 1.1.0 release.

    You can read the release highlights under https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_1_0.html and the long version of the change log under https://scikit-learn.org/stable/whats_new/v1.1.html#changes-1-1

    This version supports Python versions 3.8 to 3.10.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.2(Dec 25, 2021)

    We're happy to announce the 1.0.2 release with several bugfixes:

    You can see the changelog here: https://scikit-learn.org/dev/whats_new/v1.0.html#version-1-0-2

    You can upgrade with pip as usual:

    pip install -U scikit-learn
    

    The conda-forge builds will be available shortly, which you can then install using:

    conda install -c conda-forge scikit-learn
    
    Source code(tar.gz)
    Source code(zip)
  • 1.0.1(Oct 25, 2021)

    We're happy to announce the 1.0.1 release with several bugfixes:

    You can see the changelog here: https://scikit-learn.org/dev/whats_new/v1.0.html#version-1-0-1

    You can upgrade with pip as usual:

    pip install -U scikit-learn
    

    The conda-forge builds will be available shortly, which you can then install using:

    conda install -c conda-forge scikit-learn
    
    Source code(tar.gz)
    Source code(zip)
  • 1.0(Sep 24, 2021)

    We're happy to announce the 1.0 release. You can read the release highlights under https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_0_0.html and the long version of the change log under https://scikit-learn.org/stable/whats_new/v1.0.html#changes-1-0

    This version supports Python versions 3.7 to 3.9.

    Source code(tar.gz)
    Source code(zip)
  • 0.24.2(Apr 28, 2021)

    We're happy to announce the 0.24.2 release with several bugfixes:

    You can see the changelog here: https://scikit-learn.org/stable/whats_new/v0.24.html#version-0-24-2

    You can upgrade with pip as usual:

    pip install -U scikit-learn
    

    The conda-forge builds will be available shortly, which you can then install using:

    conda install -c conda-forge scikit-learn
    
    Source code(tar.gz)
    Source code(zip)
  • 0.24.1(Jan 19, 2021)

    We're happy to announce the 0.24.1 release with several bugfixes:

    You can see the changelog here: https://scikit-learn.org/stable/whats_new/v0.24.html#version-0-24-1

    You can upgrade with pip as usual:

    pip install -U scikit-learn
    

    The conda-forge builds will be available shortly, which you can then install using:

    conda install -c conda-forge scikit-learn
    
    Source code(tar.gz)
    Source code(zip)
  • 0.24.0(Dec 22, 2020)

    We're happy to announce the 0.24 release. You can read the release highlights under https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_24_0.html and the long version of the change log under https://scikit-learn.org/stable/whats_new/v0.24.html#version-0-24-0

    This version supports Python versions 3.6 to 3.9.

    Source code(tar.gz)
    Source code(zip)
  • 0.23.2(Aug 4, 2020)

    We're happy to announce the 0.23.2 release with several bugfixes:

    You can see the changelog here: https://scikit-learn.org/stable/whats_new/v0.23.html#version-0-23-2

    You can upgrade with pip as usual:

    pip install -U scikit-learn
    

    The conda-forge builds will be available shortly, which you can then install using:

    conda install -c conda-forge scikit-learn
    
    Source code(tar.gz)
    Source code(zip)
  • 0.23.1(May 19, 2020)

    We're happy to announce the 0.23.1 release which fixes a few issues affecting many users, namely: K-Means should be faster for small sample sizes, and the representation of third-party estimators was fixed.

    You can check this version out using:

        pip install -U scikit-learn

    You can see the changelog here: https://scikit-learn.org/stable/whats_new/v0.23.html#version-0-23-1 The conda-forge builds will be available shortly, which you can then install using:

        conda install -c conda-forge scikit-learn

    Source code(tar.gz)
    Source code(zip)
  • 0.23.0(May 12, 2020)

    We're happy to announce the 0.23 release. You can read the release highlights under https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_23_0.html and the long version of the change log under https://scikit-learn.org/stable/whats_new/v0.23.html#version-0-23-0

    This version supports Python versions 3.6 to 3.8.

    Source code(tar.gz)
    Source code(zip)
  • 0.22.2.post1(Mar 4, 2020)

    We're happy to announce the 0.22.2.post1 bugfix release.

    The 0.22.2.post1 release includes a packaging fix for the source distribution but the content of the packages is otherwise identical to the content of the wheels with the 0.22.2 version (without the .post1 suffix).

    Change log under https://scikit-learn.org/stable/whats_new/v0.22.html#changes-0-22-2.

    This version supports Python versions 3.5 to 3.8.

    Source code(tar.gz)
    Source code(zip)
    scikit-learn-0.22.2.post1.tar.gz(6.62 MB)
  • 0.22.1(Jan 2, 2020)

  • 0.22(Dec 3, 2019)

  • 0.20.4(Jul 30, 2019)

    Builds on top of Scikit-learn 0.20.3 to fix regressions and other issues released in version 0.20. See change log at https://scikit-learn.org/0.20/whats_new/v0.20.html

    Source code(tar.gz)
    Source code(zip)
  • 0.21.3(Jul 30, 2019)

    A bug fix and documentation release, fixing regressions and other issues released in version 0.21. See change log at https://scikit-learn.org/0.21/whats_new/v0.21.html

    Source code(tar.gz)
    Source code(zip)
  • 0.21.2(May 23, 2019)

  • 0.21.1(May 15, 2019)

  • 0.21.0(May 10, 2019)

  • 0.20.3(Mar 2, 2019)

  • 0.20.2(Dec 20, 2018)

  • 0.20.1(Nov 25, 2018)

  • 0.20.0(Nov 22, 2018)

  • 0.19.2(Nov 22, 2018)

Owner
scikit-learn
Repositories related to the scikit-learn Python machine learning library.
scikit-learn
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 5, 2022
SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

SciKit-Learn Laboratory This Python package provides command-line utilities to make it easier to run machine learning experiments with scikit-learn. O

ETS 528 Nov 25, 2022
Genetic Programming in Python, with a scikit-learn inspired API

Welcome to gplearn! gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API. While Genetic Programming (GP)

Trevor Stephens 1.3k Jan 3, 2023
Using python and scikit-learn to make stock predictions

MachineLearningStocks in python: a starter project and guide EDIT as of Feb 2021: MachineLearningStocks is no longer actively maintained MachineLearni

Robert Martin 1.3k Dec 29, 2022
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

null 4.9k Dec 31, 2022
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

null 3.8k Feb 13, 2021
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

null 4.9k Jan 3, 2023
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

null 213 Jan 2, 2023
scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

null 417 Dec 20, 2022
Genetic feature selection module for scikit-learn

sklearn-genetic Genetic feature selection module for scikit-learn Genetic algorithms mimic the process of natural selection to search for optimal valu

Manuel Calzolari 260 Dec 14, 2022
Use evolutionary algorithms instead of gridsearch in scikit-learn

sklearn-deap Use evolutionary algorithms instead of gridsearch in scikit-learn. This allows you to reduce the time required to find the best parameter

rsteca 709 Jan 3, 2023
SigOpt wrappers for scikit-learn methods

SigOpt + scikit-learn Interfacing This package implements useful interfaces and wrappers for using SigOpt and scikit-learn together Getting Started In

SigOpt 73 Sep 30, 2022
A scikit-learn-compatible module for estimating prediction intervals.

|Anaconda|_ MAPIE - Model Agnostic Prediction Interval Estimator MAPIE allows you to easily estimate prediction intervals using your favourite sklearn

SimAI 584 Dec 27, 2022
Regression Metrics Calculation Made easy for tensorflow2 and scikit-learn

Regression Metrics Installation To install the package from the PyPi repository you can execute the following command: pip install regressionmetrics I

Ashish Patel 11 Dec 16, 2022
A real-time speech emotion recognition application using Scikit-learn and gradio

Speech-Emotion-Recognition-App A real-time speech emotion recognition application using Scikit-learn and gradio. Requirements librosa==0.6.3 numpy sou

Son Tran 6 Oct 4, 2022
Convert scikit-learn models to PyTorch modules

sk2torch sk2torch converts scikit-learn models into PyTorch modules that can be tuned with backpropagation and even compiled as TorchScript. Problems

Alex Nichol 101 Dec 16, 2022
This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

Deepender Singla 1.4k Dec 22, 2022
Objective of the repository is to learn and build machine learning models using Pytorch. 30DaysofML Using Pytorch

30 Days Of Machine Learning Using Pytorch Objective of the repository is to learn and build machine learning models using Pytorch. List of Algorithms

Mayur 119 Nov 24, 2022