Hyper-parameter optimization for sklearn

Last update: Jan 1, 2023

Related tags

Deep Learning hyperopt-sklearn

Overview

hyperopt-sklearn

Hyperopt-sklearn is Hyperopt-based model selection among machine learning algorithms in scikit-learn.

See how to use hyperopt-sklearn through examples or older notebooks

More examples can be found in the Example Usage section of the SciPy paper

Komer B., Bergstra J., and Eliasmith C. "Hyperopt-Sklearn: automatic hyperparameter configuration for Scikit-learn" Proc. SciPy 2014. http://conference.scipy.org/proceedings/scipy2014/pdfs/komer.pdf

Installation

Installation from a git clone using pip is supported:

git clone [email protected]:hyperopt/hyperopt-sklearn.git
(cd hyperopt-sklearn && pip install -e .)

Usage

If you are familiar with sklearn, adding the hyperparameter search with hyperopt-sklearn is only a one line change from the standard pipeline.

from hpsklearn import HyperoptEstimator, svc
from sklearn import svm

# Load Data
# ...

if use_hpsklearn:
    estim = HyperoptEstimator(classifier=svc('mySVC'))
else:
    estim = svm.SVC()

estim.fit(X_train, y_train)

print(estim.score(X_test, y_test))
# <<show score here>>

Each component comes with a default search space. The search space for each parameter can be changed or set constant by passing in keyword arguments. In the following example the penalty parameter is held constant during the search, and the loss and alpha parameters have their search space modified from the default.

from hpsklearn import HyperoptEstimator, sgd
from hyperopt import hp
import numpy as np

sgd_penalty = 'l2'
sgd_loss = hp.pchoice(’loss’, [(0.50, ’hinge’), (0.25, ’log’), (0.25, ’huber’)])
sgd_alpha = hp.loguniform(’alpha’, low=np.log(1e-5), high=np.log(1))

estim = HyperoptEstimator(classifier=sgd(’my_sgd’, penalty=sgd_penalty, loss=sgd_loss, alpha=sgd_alpha))
estim.fit(X_train, y_train)

Complete example using the Iris dataset:

from hpsklearn import HyperoptEstimator, any_classifier, any_preprocessing
from sklearn.datasets import load_iris
from hyperopt import tpe
import numpy as np

# Download the data and split into training and test sets

iris = load_iris()

X = iris.data
y = iris.target

test_size = int(0.2 * len(y))
np.random.seed(13)
indices = np.random.permutation(len(X))
X_train = X[indices[:-test_size]]
y_train = y[indices[:-test_size]]
X_test = X[indices[-test_size:]]
y_test = y[indices[-test_size:]]

# Instantiate a HyperoptEstimator with the search space and number of evaluations

estim = HyperoptEstimator(classifier=any_classifier('my_clf'),
                          preprocessing=any_preprocessing('my_pre'),
                          algo=tpe.suggest,
                          max_evals=100,
                          trial_timeout=120)

# Search the hyperparameter space based on the data

estim.fit(X_train, y_train)

# Show the results

print(estim.score(X_test, y_test))
# 1.0

print(estim.best_model())
# {'learner': ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
#           max_depth=3, max_features='log2', max_leaf_nodes=None,
#           min_impurity_decrease=0.0, min_impurity_split=None,
#           min_samples_leaf=1, min_samples_split=2,
#           min_weight_fraction_leaf=0.0, n_estimators=13, n_jobs=1,
#           oob_score=False, random_state=1, verbose=False,
#           warm_start=False), 'preprocs': (), 'ex_preprocs': ()}

Here's an example using MNIST and being more specific on the classifier and preprocessing.

from hpsklearn import HyperoptEstimator, extra_trees
from sklearn.datasets import fetch_mldata
from hyperopt import tpe
import numpy as np

# Download the data and split into training and test sets

digits = fetch_mldata('MNIST original')

X = digits.data
y = digits.target

test_size = int(0.2 * len(y))
np.random.seed(13)
indices = np.random.permutation(len(X))
X_train = X[indices[:-test_size]]
y_train = y[indices[:-test_size]]
X_test = X[indices[-test_size:]]
y_test = y[indices[-test_size:]]

# Instantiate a HyperoptEstimator with the search space and number of evaluations

estim = HyperoptEstimator(classifier=extra_trees('my_clf'),
                          preprocessing=[],
                          algo=tpe.suggest,
                          max_evals=10,
                          trial_timeout=300)

# Search the hyperparameter space based on the data

estim.fit( X_train, y_train )

# Show the results

print(estim.score(X_test, y_test))
# 0.962785714286 

print(estim.best_model())
# {'learner': ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='entropy',
#           max_depth=None, max_features=0.959202875857,
#           max_leaf_nodes=None, min_impurity_decrease=0.0,
#           min_impurity_split=None, min_samples_leaf=1,
#           min_samples_split=2, min_weight_fraction_leaf=0.0,
#           n_estimators=20, n_jobs=1, oob_score=False, random_state=3,
#           verbose=False, warm_start=False), 'preprocs': (), 'ex_preprocs': ()}

Available Components

Not all of the classifiers/regressors/preprocessing from sklearn have been implemented yet. A list of those currently available is shown below. If there is something you would like that is not on the list, feel free to make an issue or a pull request! The source code for implementing these functions is found here

Classifiers

svc
svc_linear
svc_rbf
svc_poly
svc_sigmoid
liblinear_svc

knn

ada_boost
gradient_boosting

random_forest
extra_trees
decision_tree

sgd

xgboost_classification

multinomial_nb
gaussian_nb

passive_aggressive

linear_discriminant_analysis
quadratic_discriminant_analysis

one_vs_rest
one_vs_one
output_code

For a simple generic search space across many classifiers, use any_classifier. If your data is in a sparse matrix format, use any_sparse_classifier.

Regressors

svr
svr_linear
svr_rbf
svr_poly
svr_sigmoid

knn_regression

ada_boost_regression
gradient_boosting_regression

random_forest_regression
extra_trees_regression

sgd_regression

xgboost_regression

For a simple generic search space across many regressors, use any_regressor. If your data is in a sparse matrix format, use any_sparse_regressor.

Preprocessing

pca

one_hot_encoder

standard_scaler
min_max_scaler
normalizer

ts_lagselector

tfidf

rbm

colkmeans

For a simple generic search space across many preprocessing algorithms, use any_preprocessing. If you are working with raw text data, use any_text_preprocessing. Currently only TFIDF is used for text, but more may be added in the future. Note that the preprocessing parameter in HyperoptEstimator is expecting a list, since various preprocessing steps can be chained together. The generic search space functions any_preprocessing and any_text_preprocessing already return a list, but the others do not so they should be wrapped in a list. If you do not want to do any preprocessing, pass in an empty list [].

Comments

No module named "hpsklearn.estimator"

Hello everyone, I'm trying to use Hyperopt-Sklearn in Google Colab. However, after using "!pip install hpsklearn", I get the following error message when trying to import the estimator through "from hpsklearn import HyperoptEstimator": ModuleNotFoundError Traceback (most recent call last) in () 1 get_ipython().system('pip install hpsklearn') ----> 2 from hpsklearn import HyperoptEstimator 3 from hpsklearn import any_classifier 4 from hpsklearn import any_preprocessing 5 from hyperopt import tpe

/usr/local/lib/python3.7/dist-packages/hpsklearn/init.py in () ----> 1 from .estimator import hyperopt_estimator as HyperoptEstimator 2 from .components import * # noqa 3 from .components.multiclass import
4 one_vs_rest_classifier,
5 one_vs_one_classifier, \

ModuleNotFoundError: No module named 'hpsklearn.estimator'

I have also tried loading the Estimator by accessing the github through "!pip install git+https://github.com/hyperopt/hyperopt-sklearn", but get the same error message.

I have the following packages/libraries installed before trying to work with Hyperopt: !pip install imgaug==0.2.5 !pip install --upgrade scipy !pip install --upgrade sklearn !sudo apt-get install build-essential swig !pip install auto-sklearn import autosklearn.classification import autosklearn.regression import numpy as np import random import pandas as pd import sklearn.datasets import sklearn.metrics import scipy import matplotlib.pyplot as plt import csv import pprint as pprint from sklearn.model_selection import train_test_split from sklearn.model_selection import RepeatedKFold from sklearn.model_selection import RepeatedStratifiedKFold

Is anybody else having this issue and if so, is there a solution or workaround? I'm fairly new to AutoML and its programming.
bug

opened by LarissaHolm 15
Use numpy.random's Generator instead of legacy RandomState

This adapts hyperopt-sklearn to the change in hyperopt#821, to fix hyperopt#829.

Note: Just like the original PR, this will lead to an API breakage, since numpy.random.Generator is not backwards-compatible with numpy.random.RandomState.

opened by rharish101 14
Rewrite for sklearn1.0.0
Hereby my rewrite/update for hyperopt-sklearn.

Why: Since the sklearn 1.0.0 update, hyperopt-sklearn does not function. Fixing this was the original intent behind my edits, thus the name of my branch. When diving into the code, I found more things that should be added and could be optimized.

Additions and changes:

Added/Updated I have implemented all or nearly all components in scikit-learn; classifiers, regressors and preprocessors. Many components were missing but I would have liked to use them. Others I added just for the sake of completeness.

Added type hinting for parameters.

Added. To have easier debugging with custom parameter values, I have made a validation_test decorator that verifies the input of some parameters. Since all these checks are also done in sklearn, it does not make sense to have this complete. Therefore I've only implemented it where I saw fit.

Added a totally new structure of the hp-sklearn project.

There is a main project folder where two sub-folders (1. components and 2. estimator) are located.

Components are now located in a separate folder (instead of a single file). Where the file structure mimicks the file structure of the scikit-learn project. (Mimicks in the sense of naming conventions and thus location of search spaces for components)

The estimator folder contains the estimator where the functions are separated over files based on their purpose. Problems in the estimator should now be very easy to find.

Tests are located in a separate test folder that follows the exact structure of the sub-folders in the main project folder.

Common helper functions in the components folder and also in the test folder are contained in a utils.py file.

Added automated testing on pull requests and commits using unittest, coverage, tox and github actions. This will be beneficial short-term for maintenance of the project. This will be beneficial long-term when I/we iron out some present issues and create workflows that better suit this use-case (for example; running only relevant tests on a pr).

Updated python setup files and project files.

Updated some python 3 best practices.

Added #169 and #137

Suggested future edits:

Addition of some nice badges on the readme.

A workflow that takes parameterized commands from comments on issues and pull requests that tests a module or posts coverage.

Edits to the workflow so only relevant tests will run on pull requests.

Following changes in scikit-learn and maintaining hyperopt-sklearn accordingly.

I'd love some reviews or feedback.
enhancement
opened by mandjevant 11
Add sklearn regressors, optimize hyperparameter search space and enhance cross-validation
Contributions in this PR:

Add sklearn regressors.

Optimize hyperparameter search space for classifiers/regressors so that the learners give good performance in most cases.

Add K-fold, leave-one-out, shuffle-and-split cross-validation with shuffle option and stratification for classification.

Add a lag selector as preprocessing for time series forecasting problems.

Refactoring to improve code modularity and readability.

Fix obsolete tests so that they run under current version of hpsklearn.

Misc. bug fixes and enhancements.
opened by lishen 11
Why requires NumPy==1.11.0?

Is there a specific problem with other versions? For example, our infrastructure uses numpy-1.13.3 and installing hyperopt-sklearn basically downgrades numpy.

It seems both sklearn and hyperopt only specifies generic numpy dependency.

opened by aht 6
Add lightgbm support
#128

Quick discussion of the hyperparameters as documented here.

General comments:

Generally used XGBoost hyperparameter choices where possible.

Only difference is the boosting type hyperparameter.

My understanding of lightgbm is that it tries to optimize on time by performing categorisation and binning of data. Hence, I tried to avoid hyperparameters that were simply 'if you increase X (e.g. more bins) then it will increase accuracy but it will take longer' as they aren't a helpful comparison when hyperparameter tuning.
opened by boba-and-beer 5
Fix to crossvalidation and test

Using the CV options without use_partial_fit causes an error because cv_n_iters gets filled with None, which can't be passed to max(). This is a patch that fixes this, along with a test that used to fail, but now doesn't.

opened by adodge 5
Add more classifiers
Hello again!

I've recently implemented the following classifiers into hyperopt-sklearn:

AdaBoost

Single decision tree

Gaussian Naive Bayes

Gradient Boosting

Linear Discriminant Analysis

Quadratic Discriminant Analysis

Passive Aggressive Online Learning

I hope this helps! Let me know if I should change anything.

Many thanks, Chandler
opened by watsonkm 5

max_depth=None setting overridden in components._trees_hp_space

I would like to specify search space of random forest without depth limit (as max_depth=None in sklearn.ensemble.RandomForestRegressor) via components.random_forest_regression function but when specifying random_forest_regression(max_depth=None) this setting is overridden with search space hp.pchoice(name, [ (0.7, None), (0.1, 2), (0.1, 3), (0.1, 4), ]) since the lines (in components):

335 def _trees_max_depth(name):
336     return hp.pchoice(name, [
337         (0.7, None),  # most common choice.
338         # Try some shallow trees.
339         (0.1, 2),
340         (0.1, 3),
341         (0.1, 4),
342    ])
... # and lines (in `components._trees_hp_space function`):
757        max_depth=(_trees_max_depth(name_func('max_depth'))
758                   if max_depth is None else max_depth),

So I propose PR with lines (inside _trees_hp_space):

742        max_depth="unspecified",
...
757        max_depth=(_trees_max_depth(name_func('max_depth'))
758                   if max_depth is "unspecified" else max_depth),

TL;DR:

Currently components.random_forest_regression(max_depth=None):

5  max_depth =
6   switch
7     hyperopt_param
8       Literal{RF.rfr_max_depth}
9       categorical
10        pos_args
11           Literal{0.7}
12           Literal{0.1}
13           Literal{0.1}
14           Literal{0.1}
15     Literal{None}
16     Literal{2}
17     Literal{3}
18     Literal{4}

Expected:

5 max_depth =
6   Literal{None}

opened by B0Gec 4

Add parameter n_jobs for multiprocessing to hyperopt_estimator

Hi, this PR addresses the discussion in issue https://github.com/hyperopt/hyperopt-sklearn/issues/82 and especially comment https://github.com/hyperopt/hyperopt-sklearn/issues/82#issuecomment-430963445.

To support (at least some) multiprocessing, I added the well-known sklearn/joblib parameter n_jobs to the hyperopt_estimator function. Whenever an estimator is called which supports multiprocessing, it is passed n_jobs as argument:

estim = hpsklearn.HyperoptEstimator(..., n_jobs=2)

Two caveats: First, for smaller data sets, this rather slows some sklearn functions down due to parallelization overhead (cf. e.g. https://github.com/scikit-learn/scikit-learn/issues/6645 or https://github.com/scikit-learn/scikit-learn/issues/8216). However, a quick analysis showed that for larger datasets there may be some benefit by using multiple cores:

Legend: Color encodes number of samples for a dummy sklearn.datasets.make_classification task. y-axis shows time needed to perform 30 evals relative to time with one core. Random seeds were set to ensure that hyperopt runs were identical. Repetitions would be needed for reliable time estimates etc. but this should be enough for demonstration purposes.

Second, it would be nice to parallelize the cross-validation part in _cost_fn but there is an interesting for-else statement that I could not handle using joblib... :) Suggestions are welcome.

I'm happy for ideas how to improve this PR. In case this PR should not be merged, that's perfectly fine, too. Thanks and best!

opened by DavidBreuer 4
Python 3 Compatibility

Hello!

On line 72 of setup.py, there is a line that reads

subdirectories = os.walk(package_to_path(package)).next()[1]

but to be compatible with Python 3 as well, this should read

subdirectories = next(os.walk(package_to_path(package)))[1]

I cloned the repository and tested this out myself, and it installed without a hitch in Python 3.5. Would there be a chance this change could be implemented in the repository?

opened by watsonkm 4
Any Advice on Avoiding 'NaN' errors
Sometimes when I run this:

test_model = HyperoptEstimator(regressor=gradient_boosting_regressor('test_regres'), preprocessing=[], algo=tpe.suggest, max_evals=50, trial_timeout=30) test_model.fit(X_train.to_numpy(), y_train.to_numpy().ravel())

I get an error ValueError: Input contains NaN. during training. It doesn't happen every time and I know that the data has no nan's, infinites, or duplicates. This leads me to believe one of the operations is creating a NaN. Is there anyway to skip these operations or deduce what operation is causing this?
bug
opened by dfossl 7
Update pypi project page
Issue: https://pypi.org/project/hpsklearn/

The pypi package for the project is outdated.

The project description is incorrect and requires appropriate formatting.

Perhaps pip install hyperopt-sklearn should be used instead of pip install hpsklearn.

Proposed solutions:

Bjkomer updates the project under hpsklearn and will suitably edit the project description and the other information that is displayed.

I will add this project under hyperopt-sklearn to pypi and update the project description accordingly.

enhancement
opened by mandjevant 0
AttributeError: 'numpy.random.mtrand.RandomState' object has no attribute 'integers'

I encountered a AttributeError: 'numpy.random.mtrand.RandomState' object has no attribute 'integers' at the hyperopt/fmin.py in run(self, N, block_until_done). My numpy and sklearn version are 1.19.2 and 1.0.1, respectively.

opened by john-zeng112 4
job exception: __init__() got an unexpected keyword argument 'presort'

Hi,

I am just running your following sample code:

from hpsklearn import HyperoptEstimator, any_classifier, any_preprocessing from sklearn.datasets import load_iris from hyperopt import tpe import numpy as np

iris = load_iris()

X = iris.data y = iris.target

test_size = int(0.2 * len(y)) np.random.seed(13) indices = np.random.permutation(len(X))

X_train = X[indices[:-test_size]] y_train = y[indices[:-test_size]] X_test = X[indices[-test_size:]] y_test = y[indices[-test_size:]]

estim = HyperoptEstimator(classifier=any_classifier('my_clf'), preprocessing=any_preprocessing('my_pre'), algo=tpe.suggest, max_evals=100, trial_timeout=120)

estim.fit(X_train, y_train)

print(estim.score(X_test, y_test)) print(estim.best_model())

However, I receive the following error: job exception: init() got an unexpected keyword argument 'presort' init() got an unexpected keyword argument 'presort'

Could you please help me about it?

opened by muratonuryildirim 4
AllTrialsFailed:

Hey Hyperopt team!

I've been facing an intermittent issue when fitting the HyperoptEstimator to my data.

I've changed nothing in code but re-running the same code when restarting my kernel results in this error: ~\anaconda3\lib\site-packages\hyperopt\base.py in best_trial(self) 620 ] 621 if not candidates: --> 622 raise AllTrialsFailed 623 losses = [float(t["result"]["loss"]) for t in candidates] 624 if len(losses) == 0:

AllTrialsFailed:

Is there any reason why this is the case? I have tried fitting with the nonscaled version of my data and it sometimes allows it to run. But I have been stuck for hours trying to make it work, your help is much appreciated! Thank you!

P.s. quite literally after failing for 20 + times, after i clicked post here the code suddenly started working again. And yes it intermittently works and fails accordingly..

opened by Kayden-lolasery 1

RuntimeError freeze_support() issue

I run into this error:

RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

It seems like it is something to do with pytorch. Can someone resolve this issue please? The example for hpsklearn also ran into the same error. Thanks a lot.

opened by monex-p 1

Hyper-parameter optimization for sklearn

Related tags

Overview

hyperopt-sklearn

Installation

Usage

Available Components

Classifiers

Regressors

Preprocessing

Comments

Owner

Automates Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning :rocket:

Code for the paper "Query Embedding on Hyper-relational Knowledge Graphs"

Facilitating Database Tuning with Hyper-ParameterOptimization: A Comprehensive Experimental Evaluation

Milano is a tool for automating hyper-parameters search for your models on a backend of your choice.

RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

library for nonlinear optimization, wrapping many algorithms for global and local, constrained or unconstrained, optimization

Racing line optimization algorithm in python that uses Particle Swarm Optimization.

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Visualize Camera's Pose Using Extrinsic Parameter by Plotting Pyramid Model on 3D Space

Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

The Power of Scale for Parameter-Efficient Prompt Tuning

🍀 Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐

Based on Yolo's low-power, ultra-lightweight universal target detection algorithm, the parameter is only 250k, and the speed of the smart phone mobile terminal can reach ~300fps+

Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)

A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data