Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning

MLJAR

Last update: Jan 2, 2023

Related tags

Machine Learning data-science machine-learning neural-network random-forest scikit-learn xgboost hyperparameter-optimization lightgbm ensemble feature-engineering decision-tree hyper-parameters automl automated-machine-learning catboost automatic-machine-learning mljar shap tuning-algorithm models-tuning

Overview

MLJAR Automated Machine Learning for Humans

Documentation: https://supervised.mljar.com/

Source Code: https://github.com/mljar/mljar-supervised

Automated Machine Learning 🚀

The mljar-supervised is an Automated Machine Learning Python package that works with tabular data. It is designed to save time for a data scientist. It abstracts the common way to preprocess the data, construct the machine learning models, and perform hyper-parameters tuning to find the best model 🏆 . It is no black-box as you can see exactly how the ML pipeline is constructed (with a detailed Markdown report for each ML model).

The mljar-supervised will help you with:

explaining and understanding your data (Automatic Exploratory Data Analysis),
trying many different machine learning models (Algorithm Selection and Hyper-Parameters tuning),
creating Markdown reports from analysis with details about all models (Atomatic-Documentation),
saving, re-running and loading the analysis and ML models.

It has four built-in modes of work:

Explain mode, which is ideal for explaining and understanding the data, with many data explanations, like decision trees visualization, linear models coefficients display, permutation importances and SHAP explanations of data,
Perform for building ML pipelines to use in production,
Compete mode that trains highly-tuned ML models with ensembling and stacking, with a purpose to use in ML competitions.
Optuna mode that can be used to search for highly-tuned ML models, should be used when the performance is the most important, and computation time is not limited (it is available from version 0.10.0)

Of course, you can further customize the details of each mode to meet the requirements.

What's good in it? 💥

It is using many algorithms: Baseline, Linear, Random Forest, Extra Trees, LightGBM, Xgboost, CatBoost, Neural Networks, and Nearest Neighbors.
It can compute Ensemble based on greedy algorithm from Caruana paper.
It can stack models to build level 2 ensemble (available in Compete mode or after setting stack_models parameter).
It can do features preprocessing, like: missing values imputation and converting categoricals. What is more, it can also handle target values preprocessing.
It can do advanced features engineering, like: Golden Features, Features Selection, Text and Time Transformations.
It can tune hyper-parameters with not-so-random-search algorithm (random-search over defined set of values) and hill climbing to fine-tune final models.
It can compute the Baseline for your data. That you will know if you need Machine Learning or not!
It has extensive explanations. This package is training simple Decision Trees with max_depth <= 5, so you can easily visualize them with amazing dtreeviz to better understand your data.
The mljar-supervised is using simple linear regression and include its coefficients in the summary report, so you can check which features are used the most in the linear model.
It cares about explainability of models: for every algorithm, the feature importance is computed based on permutation. Additionally, for every algorithm the SHAP explanations are computed: feature importance, dependence plots, and decision plots (explanations can be switched off with explain_level parameter).
There is automatic documnetation for every ML experiment run with AutoML. The mljar-supervised creates markdown reports from AutoML training full of ML details, metrics and charts.

Automatic Documentation

The AutoML Report

The report from running AutoML will contain the table with infomation about each model score and time needed to train the model. For each model there is a link, which you can click to see model's details. The performance of all ML models is presented as scatter and box plots so you can visually inspect which algorithms perform the best 🏆 .

The `Decision Tree` Report

The example for Decision Tree summary with trees visualization. For classification tasks additional metrics are provided:

confusion matrix
threshold (optimized in the case of binary classification task)
F1 score
Accuracy
Precision, Recall, MCC

The `LightGBM` Report

The example for LightGBM summary:

Available Modes 📚

In the docs you can find details about AutoML modes are presented in the table .

Explain

automl = AutoML(mode="Explain")

It is aimed to be used when the user wants to explain and understand the data.

It is using 75%/25% train/test split.
It is using: Baseline, Linear, Decision Tree, Random Forest, Xgboost, Neural Network algorithms and ensemble.
It has full explanations: learning curves, importance plots, and SHAP plots.

Perform

automl = AutoML(mode="Perform")

It should be used when the user wants to train a model that will be used in real-life use cases.

It is using 5-fold CV.
It is using: Linear, Random Forest, LightGBM, Xgboost, CatBoost and Neural Network. It uses ensembling.
It has learning curves and importance plots in reports.

Compete

automl = AutoML(mode="Compete")

It should be used for machine learning competitions.

It adapts the validation strategy depending on dataset size and total_time_limit. It can be: train/test split (80/20), 5-fold CV or 10-fold CV.
It is using: Linear, Decision Tree, Random Forest, Extra Trees, LightGBM, Xgboost, CatBoost, Neural Network and Nearest Neighbors. It uses ensemble and stacking.
It has only learning curves in the reports.

Optuna

automl = AutoML(mode="Optuna", optuna_time_budget=3600)

It should be used when the performance is the most important and time is not limited.

It is using 10-fold CV
It is using: Random Forest, Extra Trees, LightGBM, Xgboost, and CatBoost. Those algorithms are tuned by Optuna framework for optuna_time_budget seconds, each. Algorithms are tuned with original data, without advanced feature engineering.
It is using advanced feature engineering, stacking and ensembling. The hyperparameters found for original data are reused with those steps.
It produces learning curves in the reports.

Examples

👉 Binary Classification Example

There is a simple interface available with fit and predict methods.

import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",
    skipinitialspace=True,
)
X_train, X_test, y_train, y_test = train_test_split(
    df[df.columns[:-1]], df["income"], test_size=0.25
)

automl = AutoML()
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)

AutoML fit will print:

Create directory AutoML_1
AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will optimize for metric: logloss
1_Baseline final logloss 0.5519845471086654 time 0.08 seconds
2_DecisionTree final logloss 0.3655910192804364 time 10.28 seconds
3_Linear final logloss 0.38139916864708445 time 3.19 seconds
4_Default_RandomForest final logloss 0.2975204390214936 time 79.19 seconds
5_Default_Xgboost final logloss 0.2731086827200411 time 5.17 seconds
6_Default_NeuralNetwork final logloss 0.319812276905242 time 21.19 seconds
Ensemble final logloss 0.2731086821194617 time 1.43 seconds

the AutoML results in Markdown report
the Xgboost Markdown report, please take a look at amazing dependence plots produced by SHAP package 💖
the Decision Tree Markdown report, please take a look at beautiful tree visualization ✨
the Logistic Regression Markdown report, please take a look at coefficients table, and you can compare the SHAP plots between (Xgboost, Decision Tree and Logistic Regression) ☕

👉 Multi-Class Classification Example

The example code for classification of the optical recognition of handwritten digits dataset. Running this code in less than 30 minutes will result in test accuracy ~98%.

import pandas as pd 
# scikit learn utilites
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# mljar-supervised package
from supervised.automl import AutoML

# load the data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(digits.data), digits.target, stratify=digits.target, test_size=0.25,
    random_state=123
)

# train models with AutoML
automl = AutoML(mode="Perform")
automl.fit(X_train, y_train)

# compute the accuracy on test data
predictions = automl.predict_all(X_test)
print(predictions.head())
print("Test accuracy:", accuracy_score(y_test, predictions["label"].astype(int)))

👉 Regression Example

Regression example on Boston house prices data. On test data it scores ~ 10.85 mean squared error (MSE).

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from supervised.automl import AutoML # mljar-supervised

# Load the data
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(housing.data, columns=housing.feature_names),
    housing.target,
    test_size=0.25,
    random_state=123,
)

# train models with AutoML
automl = AutoML(mode="Explain")
automl.fit(X_train, y_train)

# compute the MSE on test data
predictions = automl.predict(X_test)
print("Test MSE:", mean_squared_error(y_test, predictions))

👉 More Examples

Income classification - it is a binary classification task on census data
Iris classification - it is a multiclass classification on Iris flowers data
House price regression - it is a regression task on Boston houses data

Documentation 📚

For details please check mljar-supervised docs.

Installation 📦

From PyPi repository:

pip install mljar-supervised

From source code:

git clone https://github.com/mljar/mljar-supervised.git
cd mljar-supervised
python setup.py install

Installation for development

git clone https://github.com/mljar/mljar-supervised.git
virtualenv venv --python=python3.6
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements_dev.txt

Running in the docker:

FROM python:3.7-slim-buster
RUN apt-get update && apt-get -y update
RUN apt-get install -y build-essential python3-pip python3-dev
RUN pip3 -q install pip --upgrade
RUN pip3 install mljar-supervised jupyter
CMD ["jupyter", "notebook", "--port=8888", "--no-browser", "--ip=0.0.0.0", "--allow-root"]

Contributing

To get started take a look at our Contribution Guide for information about our process and where you can fit in!

Contributors

License 👔

The mljar-supervised is provided with MIT license.

MLJAR ❤️

The mljar-supervised is an open-source project created by MLJAR. We care about ease of use in the Machine Learning. The mljar.com provides a beautiful and simple user interface for building machine learning models.

Comments

model structure differences between ensemble/stacked/ensemble_stacked
Good day! I am reading your manual now but can't tell the model structure differences between ensemble/stacked/ensemble_stacked...

Following pictures are json files from the example code and the questions are listed below, could you please help to answer them?

The meaning of "repeat" here.

How can I understand the model structure for these three pictures?

ensemble.json

Optuna_extratrees_stacked/framework.json

Ensemble_stacked/ensemble.json

Best regards
opened by Tonywhitemin 29

I have been facing this issue for 2 days. I have no Idea what's causing it.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-ea67f5362246> in <module>
      3 from sklearn.model_selection import train_test_split
      4 from sklearn.metrics import mean_squared_error
----> 5 from supervised.automl import AutoML # mljar-supervised

/opt/conda/lib/python3.7/site-packages/supervised/__init__.py in <module>
      1 __version__ = "0.7.15"
      2 
----> 3 from supervised.automl import AutoML

/opt/conda/lib/python3.7/site-packages/supervised/automl.py in <module>
      1 import logging
      2 
----> 3 from supervised.base_automl import BaseAutoML
      4 
      5 from supervised.utils.config import LOG_LEVEL

/opt/conda/lib/python3.7/site-packages/supervised/base_automl.py in <module>
     17 from sklearn.metrics import r2_score, accuracy_score
     18 
---> 19 from supervised.algorithms.registry import AlgorithmsRegistry
     20 from supervised.algorithms.registry import BINARY_CLASSIFICATION
     21 from supervised.algorithms.registry import MULTICLASS_CLASSIFICATION

/opt/conda/lib/python3.7/site-packages/supervised/algorithms/registry.py in <module>
     62 # Import algorithm to be registered
     63 import supervised.algorithms.random_forest
---> 64 import supervised.algorithms.xgboost
     65 import supervised.algorithms.decision_tree
     66 import supervised.algorithms.baseline

/opt/conda/lib/python3.7/site-packages/supervised/algorithms/xgboost.py in <module>
      4 import pandas as pd
      5 import os
----> 6 import xgboost as xgb
      7 
      8 from supervised.algorithms.algorithm import BaseAlgorithm

/opt/conda/lib/python3.7/site-packages/xgboost/__init__.py in <module>
      7 import os
      8 
----> 9 from .core import DMatrix, DeviceQuantileDMatrix, Booster
     10 from .training import train, cv
     11 from . import rabit  # noqa

/opt/conda/lib/python3.7/site-packages/xgboost/core.py in <module>
     17 import scipy.sparse
     18 
---> 19 from .compat import (
     20     STRING_TYPES, DataFrame, py_str,
     21     PANDAS_INSTALLED,

/opt/conda/lib/python3.7/site-packages/xgboost/compat.py in <module>
    106 # cudf
    107 try:
--> 108     from cudf import concat as CUDF_concat
    109 except ImportError:
    110     CUDF_concat = None

/opt/conda/lib/python3.7/site-packages/cudf/__init__.py in <module>
      9 import rmm
     10 
---> 11 from cudf import core, datasets, testing
     12 from cudf._version import get_versions
     13 from cudf.core import (

/opt/conda/lib/python3.7/site-packages/cudf/core/__init__.py in <module>
      1 # Copyright (c) 2018-2019, NVIDIA CORPORATION.
----> 2 from cudf.core import buffer, column
      3 from cudf.core.buffer import Buffer
      4 from cudf.core.dataframe import DataFrame, from_pandas, merge
      5 from cudf.core.index import (

/opt/conda/lib/python3.7/site-packages/cudf/core/column/__init__.py in <module>
      1 # Copyright (c) 2020, NVIDIA CORPORATION.
      2 
----> 3 from cudf.core.column.categorical import CategoricalColumn
      4 from cudf.core.column.column import (
      5     ColumnBase,

/opt/conda/lib/python3.7/site-packages/cudf/core/column/categorical.py in <module>
      6 
      7 import cudf
----> 8 from cudf import _lib as libcudf
      9 from cudf._lib.transform import bools_to_mask
     10 from cudf.core.buffer import Buffer

/opt/conda/lib/python3.7/site-packages/cudf/_lib/__init__.py in <module>
      2 import numpy as np
      3 
----> 4 from . import (
      5     avro,
      6     binaryop,

cudf/_lib/gpuarrow.pyx in init cudf._lib.gpuarrow()

AttributeError: module 'pyarrow.lib' has no attribute 'IpcWriteOptions'

installation

opened by kingabzpro 27

Custom eval metric

Hello,

I was wondering if it's possible to add fully custom eval metrics.

My specific use case is one where I would like to add up values of an arbitrary vector for all predictions that exceed a (percentile) threshold. In general, however, it would be great to have the option to decouple the eval metric used for fitting from the one used for evaluation/tuning.

Ideally, the user would be able to supply a function such as those in sklearn.metrics, accepting target values and predictions and returning a float. Whether to assume minimisation or maximisation (or have an extra parameter) isn't particularly important, imho.

Thoughts?
enhancement

opened by ecod3r 26
Can not load saved model

When I reloaded my model to do prediction, I got the following error:

KeyError Traceback (most recent call last) ~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/base_automl.py in load(self, path) 184 ): --> 185 ens = Ensemble.load(path, model_subpath, models_map) 186 self._models += [ens]

~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/ensemble.py in load(results_path, model_subpath, models_map) 436 ensemble.selected_models += [ --> 437 {"model": models_map[m["model"]], "repeat": m["repeat"]} 438 ]

KeyError: '15_LightGBM'

During handling of the above exception, another exception occurred:

AutoMLException Traceback (most recent call last) in ----> 1 automl.predict(X_test)

~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/automl.py in predict(self, X) 346 AutoMLException: Model has not yet been fitted. 347 """ --> 348 return self._predict(X) 349 350 def predict_proba(self, X):

~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/base_automl.py in _predict(self, X) 1298 def _predict(self, X): 1299 -> 1300 predictions = self._base_predict(X) 1301 # Return predictions 1302 # If classification task the result is in column 'label'

~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/base_automl.py in _base_predict(self, X, model) 1230 if model is None: 1231 if self._best_model is None: -> 1232 self.load(self.results_path) 1233 model = self._best_model 1234

~/miniconda3/envs/mljar/lib/python3.9/site-packages/supervised/base_automl.py in load(self, path) 211 212 except Exception as e: --> 213 raise AutoMLException(f"Cannot load AutoML directory. {str(e)}") 214 215 def get_leaderboard(

AutoMLException: Cannot load AutoML directory. '15_LightGBM'

I refit it, it said This model has already been fitted. You can use predict methods or select a new 'results_path' for a new 'fit()' I used the same method to train 5 models, the other 3 models are okay, two had this error.

I used pip install -q -U git+https://github.com/mljar/mljar-supervised.git@dev to reinstall your package. I think there are bugs when you updated LightGBM.
bug

opened by xuzhang5788 21
Support for r2 metric in Optuna mode

Currently, r2 metric evaluation is not supported in the tuner/optuna/tuner.py file.

if eval_metric.name not in ["auc", "logloss", "rmse", "mae", "mape"]: raise AutoMLException(f"Metric {eval_metric.name} is not supported")

When I manually add 'r2' to the list, I encounter the following error.

Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1054, in _fit trained = self.train_model(params) File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 356, in train_model mf.train(results_path, model_subpath) File "/usr/local/lib/python3.8/dist-packages/supervised/model_framework.py", line 185, in train self.learner_params = optuna_tuner.optimize( File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/optuna/tuner.py", line 106, in optimize objective = LightgbmObjective( File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/optuna/lightgbm.py", line 61, in __init__ self.eval_metric_name = metric_name_mapping[ml_task][self.eval_metric.name] KeyError: 'r2'

Is this a known limitation, and if so, is there a way to work around it?
enhancement help wanted

opened by Possums 21

unable to load models

Hello, i train some models and give the folder to save the models. but when i try to load the model by below command it's give me error

automl = AutoML(
  mode="Compete",
  model_time_limit=(15)*60,
  n_jobs=-1,
  results_path="/media/autosk4/",
  explain_level=0,  
  algorithms=["LightGBM","CatBoost"],
  start_random_models=2
)

_`2021-04-17 09:17:50,775 supervised.exceptions ERROR Cannot load AutoML directory. '1_Default_LightGBM_GoldenFeatures'

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/base_automl.py in load(self, path)
    185                 ):
--> 186                     ens = Ensemble.load(path, model_subpath, models_map)
    187                     self._models += [ens]

~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/ensemble.py in load(results_path, model_subpath, models_map)
    436             ensemble.selected_models += [
--> 437                 {"model": models_map[m["model"]], "repeat": m["repeat"]}
    438             ]

KeyError: '1_Default_LightGBM_GoldenFeatures'

During handling of the above exception, another exception occurred:

AutoMLException                           Traceback (most recent call last)
<ipython-input-6-437ae6b31a0f> in <module>
      6                 algorithms=["LightGBM","CatBoost"],start_random_models=2)
      7 
----> 8 predictions = automl.predict(X_test)
      9 
     10 predictions[X_test['momkene_out']!=2]=0

~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/automl.py in predict(self, X)
    344             AutoMLException: Model has not yet been fitted.
    345         """
--> 346         return self._predict(X)
    347 
    348     def predict_proba(self, X):

~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/base_automl.py in _predict(self, X)
   1298     def _predict(self, X):
   1299 
-> 1300         predictions = self._base_predict(X)
   1301         # Return predictions
   1302         # If classification task the result is in column 'label'

~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/base_automl.py in _base_predict(self, X, model)
   1230         if model is None:
   1231             if self._best_model is None:
-> 1232                 self.load(self.results_path)
   1233             model = self._best_model
   1234 

~/anaconda3/envs/autosk/lib/python3.7/site-packages/supervised/base_automl.py in load(self, path)
    212 
    213         except Exception as e:
--> 214             raise AutoMLException(f"Cannot load AutoML directory. {str(e)}")
    215 
    216     def get_leaderboard(

AutoMLException: Cannot load AutoML directory. '1_Default_LightGBM_GoldenFeatures'

and these are files in 1_default_light_... folder

framework.json		     learner_fold_2_training.log
learner_fold_0.lightgbm      learning_curves.png
learner_fold_0_training.log  predictions_out_of_folds.csv
learner_fold_1.lightgbm      README.html
learner_fold_1_training.log  README.md
learner_fold_2.lightgbm      status.txt

bug

opened by nasergh 18

Custom CV strategy

I have trained an automl model, where the ensemble seems to work well with the test set. Thus, I want to try my own CV scheme in a 'leave one year out' way (removing a year, training on other years, and testing on selected year).

For this, I need to be able to re-train the ensemble again like an scikit-learn pipeline. How can I retain the ensamble itself? The '.fit' function does not seem to work like in sklearn estimator convention (getting a numpy array as input).
enhancement

opened by drorhilman 16
Saving mljar automl model for future use

Hi, traditionally I had been using pickle package to save models in pkl file and re-use them continuously on live data. I see mljar model has to json and from json methods. Could you please create small poc or example with documentation as how could we re-use it for daily / live data? Thanks. :)
bug enhancement

opened by vivek2319 14
Error in kaggle notebook kernel
There is problem with pyarrow verion in kaggle notebook. There is an error:

[Errno 2] No such file or directory: 'AutoML_1/y.parquet'

See kaggle comment for details: https://www.kaggle.com/mt77pp/mljar-autoeda-automl-prediction/comments#1197484
bug
opened by pplonski 13
Ensemble model only using 2 models to ensemble

When I look at the readme.md file of the Ensemble folder, it shows only 2 models out of so many others that it used to ensemble. Is there a reason for this? Also, when I look at the Ensemble_stacked, it shows just 1, "Ensemble" model as the one used for stack_ensemble.

opened by alitirmizi23 11
Prediction time is taking longer than expected

Hi! I've banged my head against the wall for a couple of days and can't solve this. Prediction times are taking much longer than is to be expected. Running an AutoML model for a regression task takes upwards of 3 seconds for a single prediction.

I believe this is because the predict method for the AutoML class loads every model that was saved every time you ask for a prediction. It would be much more optimal to load all models on call init, and call their prediction methods without having to load them every time.
enhancement

opened by salomonMuriel 11
ERROR:supervised.exceptions:No models produced.

I have done the installation in various ways at Google Cop, but all of them show the following results. help me plaese..

ERROR:supervised.exceptions:No models produced. Please check your data or submit a Github issue at https://github.com/mljar/mljar-supervised/issues/new. AutoML directory: AutoML_3 The task is multiclass_classification with evaluation metric logloss AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network'] AutoML will ensemble available models AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble'] 'Baseline' Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 195, in generate_params return self.simple_algorithms_params(models_cnt) File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 382, in simple_algorithms_params params = self._get_model_params(model_type, seed=i + 1) File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 954, in _get_model_params model_info = AlgorithmsRegistry.registry[self._ml_task][model_type] KeyError: 'Baseline'

Skip simple_algorithms because no parameters were generated. 'Xgboost' Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 197, in generate_params return self.default_params(models_cnt) File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 431, in default_params if self.skip_if_rows_cols_limit(model_type): File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/mljar_tuner.py", line 400, in skip_if_rows_cols_limit max_rows_limit = AlgorithmsRegistry.get_max_rows_limit( File "/usr/local/lib/python3.8/dist-packages/supervised/algorithms/registry.py", line 51, in get_max_rows_limit return AlgorithmsRegistry.registry[ml_task]algorithm_name]["additional"][ KeyError: 'Xgboost'

Skip default_algorithms because no parameters were generated.

AutoMLException Traceback (most recent call last) in ----> 1 automl.fit(X, y)

2 frames /usr/local/lib/python3.8/dist-packages/supervised/base_automl.py in _fit(self, X, y, sample_weight, cv) 1049 if "hill_climbing" in step or step in ["ensemble", "stack"]: 1050 if len(self._models) == 0: -> 1051 raise AutoMLException( 1052 "No models produced. \nPlease check your data or" 1053 " submit a Github issue at https://github.com/mljar/mljar-supervised/issues/new."

AutoMLException: No models produced. Please check your data or submit a Github issue at https://github.com/mljar/mljar-supervised/issues/new.

opened by moonjoo98 5
standardize the project using `pipenv`
At present, setting up the development environment is difficult and inefficient.

Various packages are not compatible with the latest version of python(3.11.x)

pip install numba does not work with python 3.11.x (screenshot attached below)

pytest version is also outdated, which no longer works. Updating pytest to latest version solves the problem (5.3.5 to 7.2.0)

We can use package manager like pipenv to track all the dependencies used in the project.

References

pipenv documentation - https://pipenv.pypa.io/en/latest/

I'm using the latest version of python viz 3.11.0

Screenshot of pip install numba
opened by nkilm 0
No matching distribution found for catboost>=0.24.4
Steps to reproduce this error

create virtualenv

install requirements using pip install -r requirements.txt

Following message will be displayed.

Had to install catboost separately using pip install catboost.
opened by nkilm 0
change Arial font in base_automl.py to avoid issues in linux

in base_automl.py there is the following configuration : font-family: Arial

it causes ubunto docker containers running mljar to write the following message extensively:

WARNING | matplotlib.font_manager - findfont: Font family 'Arial' not found.

can this font be changes to some non-office font so this message will disappear ?
enhancement

opened by yairVanti 6

No Shap outputs

Hi, I'm not seeing any shap outputs when using the following:

# Initialize AutoML in Explain Mode
automl = AutoML(mode="Explain", 
                explain_level=2,
               ml_task='multiclass_classification')
automl.fit(X, y)

This in spte of shap being properly installed. What I get out of the above code is the following:

AutoML directory: AutoML_7
The task is multiclass_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models
1_Baseline logloss 3.229533 trained in 25.56 seconds
In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
2_DecisionTree logloss 2.15877 trained in 59.34 seconds
In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
3_Linear logloss 1.707406 trained in 47.68 seconds
* Step default_algorithms will try to check up to 2 models
In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
4_Default_NeuralNetwork logloss 4.045366 trained in 7.02 seconds
In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
5_Default_RandomForest logloss 1.858415 trained in 75.39 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 1.288517 trained in 0.56 seconds
AutoML fit time: 226.47 seconds
AutoML best model: Ensemble
AutoML(explain_level=2, ml_task='multiclass_classification')

bug help wanted

opened by dbrami 3

ensemble.json not found error when training in Compete mode with total_time_limit

After training Compete mode, I'm getting this error when trying to load the model

automl = AutoML(mode='Compete', results_path=model_path, total_time_limit=24*3600, eval_metric=sign_penalty)
automl_trained = AutoML(results_path=model_path)
automl_predictions = automl_trained.predict(X_test)

FileNotFoundError                         Traceback (most recent call last)
File c:\ProgramData\Anaconda3\lib\site-packages\supervised\base_automl.py:199, in BaseAutoML.load(self, path)
    196 if model_subpath.endswith("Ensemble") or model_subpath.endswith(
    197     "Ensemble_Stacked"
    198 ):
--> 199     ens = Ensemble.load(path, model_subpath, models_map)
    200     self._models += [ens]

File c:\ProgramData\Anaconda3\lib\site-packages\supervised\ensemble.py:435, in Ensemble.load(results_path, model_subpath, models_map)
    433 logger.info(f"Loading ensemble from {model_path}")
--> 435 json_desc = json.load(open(os.path.join(model_path, "ensemble.json")))
    437 ensemble = Ensemble(json_desc.get("optimize_metric"), json_desc.get("ml_task"))

FileNotFoundError: [Errno 2] No such file or directory: 'trained_models/Compete_%_change_close_BTCUSDT_spot_15m_custom_loss+2h\\Ensemble\\ensemble.json'

During handling of the above exception, another exception occurred:

AutoMLException                           Traceback (most recent call last)
c:\dev\Python\Mastermind\mastermind\training\LAB_MLJAR_custom_loss.ipynb Cell 15 in <cell line: 2>()
      [1](vscode-notebook-cell:/c%3A/dev/Python/Mastermind/mastermind/training/LAB_MLJAR_custom_loss.ipynb#X20sZmlsZQ%3D%3D?line=0) automl_trained = AutoML(results_path=model_path)
----> [2](vscode-notebook-cell:/c%3A/dev/Python/Mastermind/mastermind/training/LAB_MLJAR_custom_loss.ipynb#X20sZmlsZQ%3D%3D?line=1) automl_predictions = automl_trained.predict(X_test)
      [3](vscode-notebook-cell:/c%3A/dev/Python/Mastermind/mastermind/training/LAB_MLJAR_custom_loss.ipynb#X20sZmlsZQ%3D%3D?line=2) pd.Series(automl_predictions).describe()

File c:\ProgramData\Anaconda3\lib\site-packages\supervised\automl.py:387, in AutoML.predict(self, X)
...
    223         self.n_classes = self._data_info["n_classes"]
    225 except Exception as e:
--> 226     raise AutoMLException(f"Cannot load AutoML directory. {str(e)}")

AutoMLException: Cannot load AutoML directory. [Errno 2] No such file or directory: 'trained_models/Compete_%_change_close_BTCUSDT_spot_15m_custom_loss+2h\\Ensemble\\ensemble.json'

The errors.md file:

## Error for Ensemble

The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Traceback (most recent call last):
  File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\base_automl.py", line 1083, in _fit
    trained = self.ensemble_step(
  File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\base_automl.py", line 401, in ensemble_step
    self.ensemble.fit(oofs, target, sample_weight)
  File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\ensemble.py", line 237, in fit
    if self.metric.improvement(previous=min_score, current=score):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()


Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new

## Error for Ensemble_Stacked

The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Traceback (most recent call last):
  File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\base_automl.py", line 1083, in _fit
    trained = self.ensemble_step(
  File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\base_automl.py", line 401, in ensemble_step
    self.ensemble.fit(oofs, target, sample_weight)
  File "c:\ProgramData\Anaconda3\lib\site-packages\supervised\ensemble.py", line 237, in fit
    if self.metric.improvement(previous=min_score, current=score):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()


Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new

opened by Karlheinzniebuhr 4

Releases(v0.11.5)

v0.11.5(Dec 30, 2022)
Bug fixes and updates

#595 replace boston example dataset with California housing dataset, replace mse metric with squared_error for tree based algorithms from sklearn

#596 change the import method for dtreeviz package

Source code(tar.gz)
Source code(zip)
v0.11.4(Dec 14, 2022)
Fixes

#590 dynamically set font in a report, thanks @yairVanti!

Source code(tar.gz)
Source code(zip)
v0.11.3(Aug 16, 2022)

Unpin shap version #551
Source code(tar.gz)
Source code(zip)
v0.11.2(Mar 2, 2022)
Enhancements

#523 Add type hints to AutoML class, thank you @DanielR59

#519 save train&validation index to file in train/test split, thanks @filipsPL @MaciekEO

Bug fixes

#496 fix exception in baseline mode, thanks @DanielR59 @moshe-rl

#522 fixed requirements issue, thanks @DanielR59 @MaciekEO

#514 remove warning, thanks @MaciekEO

#511 disable EDA, thanks @MaciekEO

Source code(tar.gz)
Source code(zip)
v0.11.0(Sep 6, 2021)
Bug fixes

#463 change multiprocessing to Parallel with loky

#462 handle large data for tree visualization in regression

#419 remove/hide warnings

#411 loose dependencies for numpy and scipy

Source code(tar.gz)
Source code(zip)
0.10.4(Jun 8, 2021)
Enhancements

#81 add scatter plot predicted vs target in regression

#158 add ROC curve for binary classification

#336 add visualization for Optuna results

#352 add support for Colab

#374 update seaborn

#378 set golden features number

#379 switch off boost_on_errors step in Optuna mode

#380 add custom cross validation strategy

#386 add correlation heatmap

#387 add residual plot

#389 add feature importance heatmap

#390 add custom eval metric

#393 update sklearn

Bug fixes

#308 fix error in kaggle kernel

#353, #355, #366, #368, #376, #382, #383, #384 fixes

Docs

#391 add info about hyperparameters optimization methods

Big thank you for help for: @ecoskian, @xuzhang5788, @xiaobo, @RafaD5, @drorhilman, @strelzoff-erdc, @muxuezi, @tresoldi THANK YOU !!!
Source code(tar.gz)
Source code(zip)
0.10.3(Apr 1, 2021)
Enhancements

#343 set seed in Optuna

#344 set eval_metric directly in all algorithms

#350 add estimated train time in Optuna mode

#342 add optuna_verbose param in AutoML()

#354 add KNN in Optuna

#356 and Neural Network in Optuna

#357, #348 use mljar wrapper for Random Forest and Extra Trees

#358 add extra_tree param in LightGBM

#359 switch off feature engineering in Optuna mode - only highly tuned models are produced

#361 list all eval_metric in error message

#362 add accuracy eval_metric

#340 support for r2

Bug fixes

#347 dont include Optuna tuning time in total_time_limit

#360 missing auc scores for training in CatBoost

Source code(tar.gz)
Source code(zip)
0.10.2(Mar 17, 2021)

Add support to Python 3.9 (#339) Thanks to @rterbush!
Source code(tar.gz)
Source code(zip)
0.10.1(Mar 16, 2021)
Enhancements

#332 We added Optuna framework for hyperparameters tuning. It can be used by setting mode="Optuna" in AutoML. You can read more details at blog post: https://mljar.com/blog/automl-optuna/

Source code(tar.gz)
Source code(zip)
0.9.1(Mar 2, 2021)
Enhancements

#179 add need_retrain() method to detect performance decrease

#226 extract rules from decision tree

#310 add support for MAPE

#312 optimize prediction time

#313 set stacking time threshold depending on best model train time

#320 search for model with prediction time constraint

#322 n_jobs as a parameter

#328 disable stacking for small (nrows < 500) datasets

Bug fixes

#214 move directory after training

#246 raise exception when small time limit and no models are trained

#247 proper display for optimize AUC and R2

#306 add mix_encoding argument in AutoML constructor

#308 fix dependencies error in kaggle notebook

#314 bug fix in hill climbing in Perform mode

#323 fix catboost bug with tree limit

#324 #325 bug for feature importance for small data

Source code(tar.gz)
Source code(zip)
0.8.8(Feb 3, 2021)

Many small improvements.
Source code(tar.gz)
Source code(zip)
0.8.4(Jan 29, 2021)

A lot of small tweaks and improvements :)
Source code(tar.gz)
Source code(zip)
0.8.0(Jan 22, 2021)
Enhancements

#300 Add step with k-means additional features

#299 Add Boost On Errors step

#154 Sample weight available

#229 Sort leaderboard (disabled for now for debug purposes)

Bug fixes

#301 Fix storing unique keys in mljar tuner only for trained models

#275 #248 small fixes

Source code(tar.gz)
Source code(zip)
0.7.19(Jan 12, 2021)
Bug fixes

#293 Typo in is_scale_needed

#277 Fix problem with unit data

#285 Restricted characters in LightGBM

Source code(tar.gz)
Source code(zip)
0.7.18(Jan 11, 2021)
Bug fixes

#292 Remove unused params from CatBoost

Source code(tar.gz)
Source code(zip)
0.7.17(Jan 11, 2021)
Enhancements

#291 Disable loo encoding

#290 improve ordering in hill climbing

#287 replace mix_encoding with integet_encoding

Source code(tar.gz)
Source code(zip)
0.7.16(Jan 10, 2021)
Bug fixes

#283 Don use Random Feature model

Enhancements

#284 Check time for features selection

#286 Add R2 score

#288 Improve algorithms order in not_so_random step

Source code(tar.gz)
Source code(zip)
0.7.15(Dec 17, 2020)
#Enhancements

#274 limit number of iteration in CatBoost

Source code(tar.gz)
Source code(zip)
0.7.13(Dec 11, 2020)
Enhancements

#92 add time checks

#270 disable stacking for validation type split and repeats > 1

#271 disable ldb model sort

Source code(tar.gz)
Source code(zip)
0.7.12(Dec 8, 2020)
Enhancements

#223 Support for repeated validation

#266 Adjust validation for small datasets

Bug fixes

#265 fix validation warning

#264 fix EDA tests

#261 better error message for missing golden features

Dependencies

#260 update fastparquet to 0.4.1

Source code(tar.gz)
Source code(zip)
0.7.11(Dec 3, 2020)
Bug fixes

#258 Fix cant load automl when adjusted validation is used

Source code(tar.gz)
Source code(zip)
0.7.10(Dec 1, 2020)
Enhancements

#250 New strategies for categorical encoding

#257 Control algorithm order in not-so-random step

Bug fixes

#255 Fix overwrite in adjusted models

Source code(tar.gz)
Source code(zip)
0.7.9(Nov 30, 2020)
Enhancements

#249 Adjust validation type in Compete mode

Source code(tar.gz)
Source code(zip)
0.7.8(Nov 27, 2020)
Enhancements

#249 Adjust validation type based on data

#251 add more eval_metrics in regression

#252 add traceback to error reports

Bug fixes

#253 Fix error when text data has missing values in test fold

Source code(tar.gz)
Source code(zip)
0.7.7(Nov 26, 2020)
Enhancements

#73 Optimize AUC

Bug fixes

#136 RMSE in Extra Trees and Random Forest

#243 Switch off Xgboost and CatBoost for multiclass with many classes (in extreme switch of Extra Trees and Random Forest)

#245 Fix ordering of prediction columns

Source code(tar.gz)
Source code(zip)
0.7.6(Nov 24, 2020)
Enhancements

#240 Change algorithm execution order for default algorithms

Bug fixes:

#236 Wrong labels for target predictions in the case of -1, 1 target

#238 Object of type float32 is not JSON serializable

#239 Value Error: Input contains NaN in numpy training array

Source code(tar.gz)
Source code(zip)
0.7.5(Nov 23, 2020)
Bug fixes

(#216) Raise exception when all models with error

(#234) Fix target with first empty value

Source code(tar.gz)
Source code(zip)
0.7.4(Nov 23, 2020)
Enhancements

#184 Change Keras+TF Neural Networks to scikit-learn MLP

#233 Limit staking number of classes and models

#232 Remove Linear model from Compete mode

#208 Improve importance computation for large number of columns

#205 Remove small learning rates for Xgboost

Bug fixes:

#231 Restricted characters in feature_neams in Xgboost

#227 Fix strings in golden_features.json - thank you @SuryaThiru!

#215 Assure at least 20 samples (or k_folds) for each class

Docs update:

#213 Update docs in AutoML - thank you @shahules786!

Source code(tar.gz)
Source code(zip)
0.7.3(Sep 21, 2020)
New features :sparkles:

#176 extended EDA - thanks to @shahules786

Bug fixes :bug:

#201 error in golden features sampling

#199 bug for float multi-class labels

#196 add exception for empty data

#195 set threshold for accuracy metric instead f1

#194 ensemble should be best model if has more than 1 model

#193 fixed predict aflter model loading

#192 update pyarrow

#191 hide shap warnings

#190 fix in preprocessing

#188 fix type in feature selection - thanks to @uditswaroopa

Source code(tar.gz)
Source code(zip)
0.7.2(Sep 15, 2020)
Bug fixes :bug:

#187 fix wrong order in golden features step

#186 fix _get_results_path

#185 fix models loading

#184 exception when drop all features during selection

#182 catch exceptions from model and log to errors.md

#181 remove forbidden characters in EDA

#177 change docstring to google-stype

#175 remove tuning_mode parameter from AutoML

Source code(tar.gz)
Source code(zip)

Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning

Related tags

Overview

MLJAR Automated Machine Learning for Humans

Table of Contents

Automated Machine Learning 🚀

What's good in it? 💥

Automatic Documentation

The AutoML Report

The Decision Tree Report

The LightGBM Report

Available Modes 📚

Explain

Perform

Compete

Optuna

Examples

👉 Binary Classification Example

👉 Multi-Class Classification Example

👉 Regression Example

👉 More Examples

Documentation 📚

Installation 📦

Contributing

Contributors

License 👔

MLJAR ❤️

Comments

Skip default_algorithms because no parameters were generated.

References

Had to install catboost separately using pip install catboost.

Releases(v0.11.5)

v0.11.5(Dec 30, 2022)

v0.11.4(Dec 14, 2022)

Fixes

v0.11.3(Aug 16, 2022)

v0.11.2(Mar 2, 2022)

Enhancements

Bug fixes

v0.11.0(Sep 6, 2021)

Bug fixes

0.10.4(Jun 8, 2021)

Enhancements

Bug fixes

Docs

0.10.3(Apr 1, 2021)

Enhancements

Bug fixes

0.10.2(Mar 17, 2021)

0.10.1(Mar 16, 2021)

Enhancements

0.9.1(Mar 2, 2021)

Enhancements

Bug fixes

0.8.8(Feb 3, 2021)

0.8.4(Jan 29, 2021)

0.8.0(Jan 22, 2021)

Enhancements

Bug fixes

0.7.19(Jan 12, 2021)

Bug fixes

0.7.18(Jan 11, 2021)

Bug fixes

0.7.17(Jan 11, 2021)

Enhancements

0.7.16(Jan 10, 2021)

Bug fixes

Enhancements

0.7.15(Dec 17, 2020)

0.7.13(Dec 11, 2020)

Enhancements

0.7.12(Dec 8, 2020)

Enhancements

Bug fixes

Dependencies

0.7.11(Dec 3, 2020)

Bug fixes

0.7.10(Dec 1, 2020)

Enhancements

Bug fixes

The `Decision Tree` Report

The `LightGBM` Report

Had to install `catboost` separately using pip install catboost.