XGBoost + Optuna

Overview

AutoXGB

XGBoost + Optuna: no brainer

  • auto train xgboost directly from CSV files
  • auto tune xgboost using optuna
  • auto serve best xgboot model using fastapi

NOTE: PRs are currently not accepted. If there are issues/problems, please create an issue.

Installation

Install using pip

pip install autoxgb

Usage

Training a model using AutoXGB is a piece of cake. All you need is some tabular data.

Parameters

###############################################################################
### required parameters
###############################################################################

# path to training data
train_filename = "data_samples/binary_classification.csv"

# path to output folder to store artifacts
output = "output"

###############################################################################
### optional parameters
###############################################################################

# path to test data. if specified, the model will be evaluated on the test data
# and test_predictions.csv will be saved to the output folder
# if not specified, only OOF predictions will be saved
# test_filename = "test.csv"
test_filename = None

# task: classification or regression
# if not specified, the task will be inferred automatically
# task = "classification"
# task = "regression"
task = None

# an id column
# if not specified, the id column will be generated automatically with the name `id`
# idx = "id"
idx = None

# target columns are list of strings
# if not specified, the target column be assumed to be named `target`
# and the problem will be treated as one of: binary classification, multiclass classification,
# or single column regression
# targets = ["target"]
# targets = ["target1", "target2"]
targets = ["income"]

# features columns are list of strings
# if not specified, all columns except `id`, `targets` & `kfold` columns will be used
# features = ["col1", "col2"]
features = None

# categorical_features are list of strings
# if not specified, categorical columns will be inferred automatically
# categorical_features = ["col1", "col2"]
categorical_features = None

# use_gpu is boolean
# if not specified, GPU is not used
# use_gpu = True
# use_gpu = False
use_gpu = True

# number of folds to use for cross-validation
# default is 5
num_folds = 5

# random seed for reproducibility
# default is 42
seed = 42

# number of optuna trials to run
# default is 1000
# num_trials = 1000
num_trials = 100

# time_limit for optuna trials in seconds
# if not specified, timeout is not set and all trials are run
# time_limit = None
time_limit = 360

# if fast is set to True, the hyperparameter tuning will use only one fold
# however, the model will be trained on all folds in the end
# to generate OOF predictions and test predictions
# default is False
# fast = False
fast = False

Python API

To train a new model, you can run:

from autoxgb import AutoXGB


# required parameters:
train_filename = "data_samples/binary_classification.csv"
output = "output"

# optional parameters
test_filename = None
task = None
idx = None
targets = ["income"]
features = None
categorical_features = None
use_gpu = True
num_folds = 5
seed = 42
num_trials = 100
time_limit = 360
fast = False

# Now its time to train the model!
axgb = AutoXGB(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
axgb.train()

CLI

Train the model using the autoxgb train command. The parameters are same as above.

autoxgb train \
 --train_filename datasets/30train.csv \
 --output outputs/30days \
 --test_filename datasets/30test.csv \
 --use_gpu

You can also serve the trained model using the autoxgb serve command.

autoxgb serve --model_path outputs/mll --host 0.0.0.0 --debug

To know more about a command, run:

`autoxgb <command> --help` 
autoxgb train --help


usage: autoxgb <command> [<args>] train [-h] --train_filename TRAIN_FILENAME [--test_filename TEST_FILENAME] --output
                                        OUTPUT [--task {classification,regression}] [--idx IDX] [--targets TARGETS]
                                        [--num_folds NUM_FOLDS] [--features FEATURES] [--use_gpu] [--fast]
                                        [--seed SEED] [--time_limit TIME_LIMIT]

optional arguments:
  -h, --help            show this help message and exit
  --train_filename TRAIN_FILENAME
                        Path to training file
  --test_filename TEST_FILENAME
                        Path to test file
  --output OUTPUT       Path to output directory
  --task {classification,regression}
                        User defined task type
  --idx IDX             ID column
  --targets TARGETS     Target column(s). If there are multiple targets, separate by ';'
  --num_folds NUM_FOLDS
                        Number of folds to use
  --features FEATURES   Features to use, separated by ';'
  --use_gpu             Whether to use GPU for training
  --fast                Whether to use fast mode for tuning params. Only one fold will be used if fast mode is set
  --seed SEED           Random seed
  --time_limit TIME_LIMIT
                        Time limit for optimization
Comments
  • module 'pyarrow.lib' has no attribute 'MonthDayNanoIntervalArray'

    module 'pyarrow.lib' has no attribute 'MonthDayNanoIntervalArray'

    Getting error while using TPS November data on Kaggle conda env (my GPU is on)

    https://www.kaggle.com/yogeshkalauni/tps-nov-21-auto-xgboost-error

    Getting error while using pip install in Kaggle kernel.

    Collecting autoxgb
      Downloading autoxgb-0.2.1-py3-none-any.whl (20 kB)
    Collecting scikit-learn==1.0.1
      Downloading scikit_learn-1.0.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.2 MB)
         |████████████████████████████████| 23.2 MB 1.3 MB/s eta 0:00:01
    Requirement already satisfied: optuna==2.10.0 in /opt/conda/lib/python3.7/site-packages (from autoxgb) (2.10.0)
    Collecting pyarrow==6.0.0
      Downloading pyarrow-6.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.5 MB)
         |████████████████████████████████| 25.5 MB 43.9 MB/s eta 0:00:01
    Requirement already satisfied: pydantic==1.8.2 in /opt/conda/lib/python3.7/site-packages (from autoxgb) (1.8.2)
    Collecting loguru==0.5.3
      Downloading loguru-0.5.3-py3-none-any.whl (57 kB)
         |████████████████████████████████| 57 kB 4.9 MB/s  eta 0:00:01
    Collecting xgboost==1.5.0
      Downloading xgboost-1.5.0-py3-none-manylinux2014_x86_64.whl (173.5 MB)
         |████████████████████████████████| 173.5 MB 66 kB/s s eta 0:00:01    |██████████████████              | 97.9 MB 59.6 MB/s eta 0:00:02
    Collecting pandas==1.3.4
      Downloading pandas-1.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
         |████████████████████████████████| 11.3 MB 46.0 MB/s eta 0:00:01
    Requirement already satisfied: fastapi==0.70.0 in /opt/conda/lib/python3.7/site-packages (from autoxgb) (0.70.0)
    Requirement already satisfied: uvicorn==0.15.0 in /opt/conda/lib/python3.7/site-packages (from autoxgb) (0.15.0)
    Collecting numpy==1.21.3
      Downloading numpy-1.21.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
         |████████████████████████████████| 15.7 MB 39.9 MB/s eta 0:00:01
    Collecting joblib==1.1.0
      Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
         |████████████████████████████████| 306 kB 39.9 MB/s eta 0:00:01
    Requirement already satisfied: starlette==0.16.0 in /opt/conda/lib/python3.7/site-packages (from fastapi==0.70.0->autoxgb) (0.16.0)
    Requirement already satisfied: scipy!=1.4.0 in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (1.7.1)
    Requirement already satisfied: cliff in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (3.9.0)
    Requirement already satisfied: colorlog in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (6.5.0)
    Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (21.0)
    Requirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (4.62.3)
    Requirement already satisfied: alembic in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (1.7.4)
    Requirement already satisfied: cmaes>=0.8.2 in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (0.8.2)
    Requirement already satisfied: sqlalchemy>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (1.4.25)
    Requirement already satisfied: PyYAML in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (5.4.1)
    Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas==1.3.4->autoxgb) (2.8.0)
    Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas==1.3.4->autoxgb) (2021.1)
    Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.7/site-packages (from pydantic==1.8.2->autoxgb) (3.10.0.2)
    Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn==1.0.1->autoxgb) (2.2.0)
    Requirement already satisfied: anyio<4,>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from starlette==0.16.0->fastapi==0.70.0->autoxgb) (3.3.0)
    Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from uvicorn==0.15.0->autoxgb) (8.0.1)
    Requirement already satisfied: asgiref>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from uvicorn==0.15.0->autoxgb) (3.4.1)
    Requirement already satisfied: h11>=0.8 in /opt/conda/lib/python3.7/site-packages (from uvicorn==0.15.0->autoxgb) (0.12.0)
    Requirement already satisfied: sniffio>=1.1 in /opt/conda/lib/python3.7/site-packages (from anyio<4,>=3.0.0->starlette==0.16.0->fastapi==0.70.0->autoxgb) (1.2.0)
    Requirement already satisfied: idna>=2.8 in /opt/conda/lib/python3.7/site-packages (from anyio<4,>=3.0.0->starlette==0.16.0->fastapi==0.70.0->autoxgb) (2.10)
    Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from click>=7.0->uvicorn==0.15.0->autoxgb) (4.8.1)
    Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging>=20.0->optuna==2.10.0->autoxgb) (2.4.7)
    Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas==1.3.4->autoxgb) (1.16.0)
    Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.7/site-packages (from sqlalchemy>=1.1.0->optuna==2.10.0->autoxgb) (1.1.1)
    Requirement already satisfied: Mako in /opt/conda/lib/python3.7/site-packages (from alembic->optuna==2.10.0->autoxgb) (1.1.5)
    Requirement already satisfied: importlib-resources in /opt/conda/lib/python3.7/site-packages (from alembic->optuna==2.10.0->autoxgb) (5.2.2)
    Requirement already satisfied: PrettyTable>=0.7.2 in /opt/conda/lib/python3.7/site-packages (from cliff->optuna==2.10.0->autoxgb) (2.2.0)
    Requirement already satisfied: cmd2>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from cliff->optuna==2.10.0->autoxgb) (2.2.0)
    Requirement already satisfied: autopage>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from cliff->optuna==2.10.0->autoxgb) (0.4.0)
    Requirement already satisfied: stevedore>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from cliff->optuna==2.10.0->autoxgb) (3.4.0)
    Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from cliff->optuna==2.10.0->autoxgb) (5.6.0)
    Requirement already satisfied: colorama>=0.3.7 in /opt/conda/lib/python3.7/site-packages (from cmd2>=1.0.0->cliff->optuna==2.10.0->autoxgb) (0.4.4)
    Requirement already satisfied: attrs>=16.3.0 in /opt/conda/lib/python3.7/site-packages (from cmd2>=1.0.0->cliff->optuna==2.10.0->autoxgb) (21.2.0)
    Requirement already satisfied: pyperclip>=1.6 in /opt/conda/lib/python3.7/site-packages (from cmd2>=1.0.0->cliff->optuna==2.10.0->autoxgb) (1.8.2)
    Requirement already satisfied: wcwidth>=0.1.7 in /opt/conda/lib/python3.7/site-packages (from cmd2>=1.0.0->cliff->optuna==2.10.0->autoxgb) (0.2.5)
    Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->click>=7.0->uvicorn==0.15.0->autoxgb) (3.5.0)
    Requirement already satisfied: MarkupSafe>=0.9.2 in /opt/conda/lib/python3.7/site-packages (from Mako->alembic->optuna==2.10.0->autoxgb) (2.0.1)
    Installing collected packages: numpy, joblib, xgboost, scikit-learn, pyarrow, pandas, loguru, autoxgb
      Attempting uninstall: numpy
        Found existing installation: numpy 1.19.5
        Uninstalling numpy-1.19.5:
          Successfully uninstalled numpy-1.19.5
      Attempting uninstall: joblib
        Found existing installation: joblib 1.0.1
        Uninstalling joblib-1.0.1:
          Successfully uninstalled joblib-1.0.1
      Attempting uninstall: xgboost
        Found existing installation: xgboost 1.4.2
        Uninstalling xgboost-1.4.2:
          Successfully uninstalled xgboost-1.4.2
      Attempting uninstall: scikit-learn
        Found existing installation: scikit-learn 0.23.2
        Uninstalling scikit-learn-0.23.2:
          Successfully uninstalled scikit-learn-0.23.2
      Attempting uninstall: pyarrow
        Found existing installation: pyarrow 5.0.0
        Uninstalling pyarrow-5.0.0:
          Successfully uninstalled pyarrow-5.0.0
      Attempting uninstall: pandas
        Found existing installation: pandas 1.3.3
        Uninstalling pandas-1.3.3:
          Successfully uninstalled pandas-1.3.3
    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
    tensorflow-io 0.18.0 requires tensorflow-io-gcs-filesystem==0.18.0, which is not installed.
    explainable-ai-sdk 1.3.2 requires xai-image-widget, which is not installed.
    dask-cudf 21.8.3 requires cupy-cuda114, which is not installed.
    cudf 21.8.3 requires cupy-cuda110, which is not installed.
    beatrix-jupyterlab 3.1.1 requires google-cloud-bigquery-storage, which is not installed.
    yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.21.3 which is incompatible.
    tfx-bsl 1.3.0 requires absl-py<0.13,>=0.9, but you have absl-py 0.14.0 which is incompatible.
    tfx-bsl 1.3.0 requires numpy<1.20,>=1.16, but you have numpy 1.21.3 which is incompatible.
    tfx-bsl 1.3.0 requires pyarrow<3,>=1, but you have pyarrow 6.0.0 which is incompatible.
    tensorflow 2.6.0 requires numpy~=1.19.2, but you have numpy 1.21.3 which is incompatible.
    tensorflow 2.6.0 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
    tensorflow 2.6.0 requires typing-extensions~=3.7.4, but you have typing-extensions 3.10.0.2 which is incompatible.
    tensorflow-transform 1.3.0 requires absl-py<0.13,>=0.9, but you have absl-py 0.14.0 which is incompatible.
    tensorflow-transform 1.3.0 requires numpy<1.20,>=1.16, but you have numpy 1.21.3 which is incompatible.
    tensorflow-transform 1.3.0 requires pyarrow<3,>=1, but you have pyarrow 6.0.0 which is incompatible.
    tensorflow-io 0.18.0 requires tensorflow<2.6.0,>=2.5.0, but you have tensorflow 2.6.0 which is incompatible.
    pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.4.3 which is incompatible.
    numba 0.54.0 requires numpy<1.21,>=1.17, but you have numpy 1.21.3 which is incompatible.
    matrixprofile 1.1.10 requires protobuf==3.11.2, but you have protobuf 3.18.1 which is incompatible.
    hypertools 0.7.0 requires scikit-learn!=0.22,<0.24,>=0.19.1, but you have scikit-learn 1.0.1 which is incompatible.
    dask-cudf 21.8.3 requires dask<=2021.07.1,>=2021.6.0, but you have dask 2021.9.1 which is incompatible.
    dask-cudf 21.8.3 requires pandas<1.3.0dev0,>=1.0, but you have pandas 1.3.4 which is incompatible.
    cudf 21.8.3 requires pandas<1.3.0dev0,>=1.0, but you have pandas 1.3.4 which is incompatible.
    apache-beam 2.32.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.4 which is incompatible.
    apache-beam 2.32.0 requires numpy<1.21.0,>=1.14.3, but you have numpy 1.21.3 which is incompatible.
    apache-beam 2.32.0 requires pyarrow<5.0.0,>=0.15.1, but you have pyarrow 6.0.0 which is incompatible.
    apache-beam 2.32.0 requires typing-extensions<3.8.0,>=3.7.0, but you have typing-extensions 3.10.0.2 which is incompatible.
    Successfully installed autoxgb-0.2.1 joblib-1.1.0 loguru-0.5.3 numpy-1.21.3 pandas-1.3.4 pyarrow-6.0.0 scikit-learn-1.0.1 xgboost-1.5.0
    WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
    
    from autoxgb import AutoXGB
    
    
    # required parameters:
    train_filename = "../input/tabular-playground-series-nov-2021/train.csv"
    output = "outputt"
    
    # optional parameters
    test_filename = '../input/tabular-playground-series-nov-2021/test.csv'
    task = 'classification'
    idx = None
    targets = ["target"]
    features = None
    categorical_features = None
    use_gpu = True
    num_folds = 5
    seed = 42
    num_trials = 100
    time_limit = 7*60*60
    fast = False
    
    # Now its time to train the model!
    axgb = AutoXGB(
        train_filename=train_filename,
        output=output,
        test_filename=test_filename,
        task=task,
        idx=idx,
        targets=targets,
        features=features,
        categorical_features=categorical_features,
        use_gpu=use_gpu,
        num_folds=num_folds,
        seed=seed,
        num_trials=num_trials,
        time_limit=time_limit,
        fast=fast,
    )
    axgb.train()
    
    2021-11-01 07:03:06.106 | INFO     | autoxgb.autoxgb:__post_init__:42 - Output directory: outputt
    2021-11-01 07:03:06.108 | WARNING  | autoxgb.autoxgb:__post_init__:49 - No id column specified. Will default to `id`.
    2021-11-01 07:03:06.110 | INFO     | autoxgb.autoxgb:_process_data:149 - Reading training data
    2021-11-01 07:03:22.502 | INFO     | autoxgb.utils:reduce_memory_usage:50 - Mem. usage decreased to 117.30 Mb (74.9% reduction)
    2021-11-01 07:03:22.583 | INFO     | autoxgb.autoxgb:_determine_problem_type:140 - Problem type: binary_classification
    2021-11-01 07:03:38.131 | INFO     | autoxgb.utils:reduce_memory_usage:50 - Mem. usage decreased to 105.06 Mb (74.8% reduction)
    2021-11-01 07:03:38.132 | INFO     | autoxgb.autoxgb:_create_folds:58 - Creating folds
    2021-11-01 07:03:38.248 | INFO     | autoxgb.autoxgb:_process_data:170 - Encoding target(s)
    2021-11-01 07:03:38.282 | INFO     | autoxgb.autoxgb:_process_data:195 - Found 0 categorical features.
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    /tmp/ipykernel_38/3565386527.py in <module>
         37     fast=fast,
         38 )
    ---> 39 axgb.train()
    
    /opt/conda/lib/python3.7/site-packages/autoxgb/autoxgb.py in train(self)
        244 
        245     def train(self):
    --> 246         self._process_data()
        247         best_params = train_model(self.model_config)
        248         logger.info("Training complete")
    
    /opt/conda/lib/python3.7/site-packages/autoxgb/autoxgb.py in _process_data(self)
        210                     test_fold[categorical_features] = ord_encoder.transform(test_fold[categorical_features].values)
        211                 categorical_encoders[fold] = ord_encoder
    --> 212             fold_train.to_feather(os.path.join(self.output, f"train_fold_{fold}.feather"))
        213             fold_valid.to_feather(os.path.join(self.output, f"valid_fold_{fold}.feather"))
        214             if self.test_filename is not None:
    
    /opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        205                 else:
        206                     kwargs[new_arg_name] = new_arg_value
    --> 207             return func(*args, **kwargs)
        208 
        209         return cast(F, wrapper)
    
    /opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in to_feather(self, path, **kwargs)
       2517         from pandas.io.feather_format import to_feather
       2518 
    -> 2519         to_feather(self, path, **kwargs)
       2520 
       2521     @doc(
    
    /opt/conda/lib/python3.7/site-packages/pandas/io/feather_format.py in to_feather(df, path, storage_options, **kwargs)
         44     """
         45     import_optional_dependency("pyarrow")
    ---> 46     from pyarrow import feather
         47 
         48     if not isinstance(df, DataFrame):
    
    /opt/conda/lib/python3.7/site-packages/pyarrow/feather.py in <module>
         23                          concat_tables, schema)
         24 import pyarrow.lib as ext
    ---> 25 from pyarrow import _feather
         26 from pyarrow._feather import FeatherError  # noqa: F401
         27 from pyarrow.vendored.version import Version
    
    /opt/conda/lib/python3.7/site-packages/pyarrow/_feather.pyx in init pyarrow._feather()
    
    AttributeError: module 'pyarrow.lib' has no attribute 'MonthDayNanoIntervalArray'
    
    opened by Ankitkalauni 4
  • TypeError: __init__() got an unexpected keyword argument 'handle_unknown'

    TypeError: __init__() got an unexpected keyword argument 'handle_unknown'

    Hey I got this error and no idea where it is from. My manual xgb model works without problems with exactly the same dataset.

    I have no idea how to use github so sorry if this does not match your standards.

    Capture d’écran 2021-10-31 à 17 34 28
    opened by bigcharless 3
  • AttributeError: dlsym(0x7fd108ca6760, XGDMatrixCreateFromDense): symbol not found

    AttributeError: dlsym(0x7fd108ca6760, XGDMatrixCreateFromDense): symbol not found

    Hi

    As per the subject, I am getting the error when I am running in local:

    
    2021-11-01 15:45:04.651 | INFO     | autoxgb.autoxgb:__post_init__:42 - Output directory: output3
    2021-11-01 15:45:04.652 | WARNING  | autoxgb.autoxgb:__post_init__:49 - No id column specified. Will default to `id`.
    2021-11-01 15:45:04.653 | INFO     | autoxgb.autoxgb:_process_data:149 - Reading training data
    2021-11-01 15:45:04.885 | INFO     | autoxgb.utils:reduce_memory_usage:48 - Mem. usage decreased to 2.19 Mb (76.0% reduction)
    2021-11-01 15:45:04.891 | INFO     | autoxgb.autoxgb:_determine_problem_type:140 - Problem type: multi_class_classification
    2021-11-01 15:45:04.892 | INFO     | autoxgb.autoxgb:_create_folds:58 - Creating folds
    2021-11-01 15:45:04.922 | INFO     | autoxgb.autoxgb:_process_data:170 - Encoding target(s)
    2021-11-01 15:45:04.931 | INFO     | autoxgb.autoxgb:_process_data:195 - Found 0 categorical features.
    2021-11-01 15:45:05.054 | INFO     | autoxgb.autoxgb:_process_data:236 - Model config: train_filename='train.csv' test_filename=None idx='id' targets=['label'] problem_type=<ProblemType.multi_class_classification: 2> output='output3' features=['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'y1', 'z1', 'z2', 'z3', 'z4'] num_folds=5 use_gpu=False seed=42 categorical_features=[] num_trials=100 time_limit=360 fast=False
    2021-11-01 15:45:05.054 | INFO     | autoxgb.autoxgb:_process_data:237 - Saving model config
    2021-11-01 15:45:05.055 | INFO     | autoxgb.autoxgb:_process_data:241 - Saving encoders
    [I 2021-11-01 15:45:05,230] A new study created in RDB with name: autoxgb
    [W 2021-11-01 15:45:05,339] Trial 0 failed because of the following error: AttributeError('dlsym(0x7fd108ca6760, XGDMatrixCreateFromDense): symbol not found')
    Traceback (most recent call last):
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
        value_or_values = func(trial)
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/autoxgb/utils.py", line 172, in optimize
        model.fit(
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py", line 506, in inner_f
        return f(**kwargs)
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py", line 1231, in fit
        train_dmatrix, evals = _wrap_evaluation_matrices(
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py", line 286, in _wrap_evaluation_matrices
        train_dmatrix = create_dmatrix(
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py", line 1245, in <lambda>
        create_dmatrix=lambda **kwargs: DMatrix(nthread=self.n_jobs, **kwargs),
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py", line 506, in inner_f
        return f(**kwargs)
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py", line 616, in __init__
        handle, feature_names, feature_types = dispatch_data_backend(
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py", line 707, in dispatch_data_backend
        return _from_pandas_df(data, enable_categorical, missing, threads,
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py", line 299, in _from_pandas_df
        return _from_numpy_array(data, missing, nthread, feature_names,
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py", line 179, in _from_numpy_array
        _LIB.XGDMatrixCreateFromDense(
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
        func = self.__getitem__(name)
      File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
        func = self._FuncPtr((name_or_ordinal, self))
    AttributeError: dlsym(0x7fd108ca6760, XGDMatrixCreateFromDense): symbol not found
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    /var/folders/pp/ym01m3sx0hg3my_gzpsdl8680000gp/T/ipykernel_728/1462055845.py in <module>
         16     fast=fast,
         17 )
    ---> 18 axgb.train()
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/autoxgb/autoxgb.py in train(self)
        245     def train(self):
        246         self._process_data()
    --> 247         best_params = train_model(self.model_config)
        248         logger.info("Training complete")
        249         self.predict(best_params)
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/autoxgb/utils.py in train_model(model_config)
        211         load_if_exists=True,
        212     )
    --> 213     study.optimize(optimize_func, n_trials=model_config.num_trials, timeout=model_config.time_limit)
        214     return study.best_params
        215 
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar)
        398             )
        399 
    --> 400         _optimize(
        401             study=self,
        402             func=func,
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar)
         64     try:
         65         if n_jobs == 1:
    ---> 66             _optimize_sequential(
         67                 study,
         68                 func,
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar)
        161 
        162         try:
    --> 163             trial = _run_trial(study, func, catch)
        164         except Exception:
        165             raise
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/_optimize.py in _run_trial(study, func, catch)
        262 
        263     if state == TrialState.FAIL and func_err is not None and not isinstance(func_err, catch):
    --> 264         raise func_err
        265     return trial
        266 
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/_optimize.py in _run_trial(study, func, catch)
        211 
        212     try:
    --> 213         value_or_values = func(trial)
        214     except exceptions.TrialPruned as e:
        215         # TODO(mamu): Handle multi-objective cases.
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/autoxgb/utils.py in optimize(trial, xgb_model, use_predict_proba, eval_metric, model_config)
        170 
        171         else:
    --> 172             model.fit(
        173                 xtrain,
        174                 ytrain,
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
        504         for k, arg in zip(sig.parameters, args):
        505             kwargs[k] = arg
    --> 506         return f(**kwargs)
        507 
        508     return inner_f
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
       1229 
       1230         model, feval, params = self._configure_fit(xgb_model, eval_metric, params)
    -> 1231         train_dmatrix, evals = _wrap_evaluation_matrices(
       1232             missing=self.missing,
       1233             X=X,
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py in _wrap_evaluation_matrices(missing, X, y, group, qid, sample_weight, base_margin, feature_weights, eval_set, sample_weight_eval_set, base_margin_eval_set, eval_group, eval_qid, create_dmatrix, enable_categorical, label_transform)
        284 
        285     """
    --> 286     train_dmatrix = create_dmatrix(
        287         data=X,
        288         label=label_transform(y),
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py in <lambda>(**kwargs)
       1243             eval_group=None,
       1244             eval_qid=None,
    -> 1245             create_dmatrix=lambda **kwargs: DMatrix(nthread=self.n_jobs, **kwargs),
       1246             enable_categorical=self.enable_categorical,
       1247             label_transform=label_transform,
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
        504         for k, arg in zip(sig.parameters, args):
        505             kwargs[k] = arg
    --> 506         return f(**kwargs)
        507 
        508     return inner_f
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py in __init__(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, group, qid, label_lower_bound, label_upper_bound, feature_weights, enable_categorical)
        614             return
        615 
    --> 616         handle, feature_names, feature_types = dispatch_data_backend(
        617             data,
        618             missing=self.missing,
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical)
        705         return _from_tuple(data, missing, threads, feature_names, feature_types)
        706     if _is_pandas_df(data):
    --> 707         return _from_pandas_df(data, enable_categorical, missing, threads,
        708                                feature_names, feature_types)
        709     if _is_pandas_series(data):
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py in _from_pandas_df(data, enable_categorical, missing, nthread, feature_names, feature_types)
        297     data, feature_names, feature_types = _transform_pandas_df(
        298         data, enable_categorical, feature_names, feature_types)
    --> 299     return _from_numpy_array(data, missing, nthread, feature_names,
        300                              feature_types)
        301 
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py in _from_numpy_array(data, missing, nthread, feature_names, feature_types)
        177     config = bytes(json.dumps(args), "utf-8")
        178     _check_call(
    --> 179         _LIB.XGDMatrixCreateFromDense(
        180             _array_interface(data),
        181             config,
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/ctypes/__init__.py in __getattr__(self, name)
        384         if name.startswith('__') and name.endswith('__'):
        385             raise AttributeError(name)
    --> 386         func = self.__getitem__(name)
        387         setattr(self, name, func)
        388         return func
    
    ~/opt/anaconda3/envs/deep_py38/lib/python3.8/ctypes/__init__.py in __getitem__(self, name_or_ordinal)
        389 
        390     def __getitem__(self, name_or_ordinal):
    --> 391         func = self._FuncPtr((name_or_ordinal, self))
        392         if not isinstance(name_or_ordinal, int):
        393             func.__name__ = name_or_ordinal
    
    AttributeError: dlsym(0x7fd108ca6760, XGDMatrixCreateFromDense): symbol not found
    
    opened by chandrad 2
  • Add single column regression.

    Add single column regression.

    • add example for single-column regression (I think we should name it single-output and multi-output because single/multi "column" is a bit misleading)
    • add dummy test data for multi col regression (exactly same as train data with targets removed)
    • some minor changes in setup.py
    opened by INF800 0
  • Use `suggest_float` instead of `suggest_loguniform`.

    Use `suggest_float` instead of `suggest_loguniform`.

    Thank you for your great project! I just found a point I could contribute although I'm not sure if this repository receives pull requests. Please feel free to close the PR if you don't need it.

    Motivation

    • Trial.suggest_loguniform will be deprecated according to https://github.com/optuna/optuna/issues/2939.
    • It recommends Trial.suggest_float(..., log=True) instead of it.

    Changes

    • This PR simply replaces Trial.suggest_loguniform with Trial.suggest_float.
    opened by toshihikoyanase 0
  • TypeError: argument of type 'method' is not iterable

    TypeError: argument of type 'method' is not iterable

    Hi, I try autoxgb but very quickly it returns that error:

    2022-03-08 09:22:21.269 | INFO | autoxgb.autoxgb:post_init:42 - Output directory: output2 2022-03-08 09:22:21.276 | WARNING | autoxgb.autoxgb:post_init:49 - No id column specified. Will default to id. 2022-03-08 09:22:21.283 | INFO | autoxgb.autoxgb:_process_data:149 - Reading training data

    TypeError Traceback (most recent call last) in () 37 fast=fast, 38 ) ---> 39 axgb.train() 10 frames /usr/local/lib/python3.7/dist-packages/autoxgb/autoxgb.py in train(self) 244 245 def train(self): --> 246 self._process_data() 247 best_params = train_model(self.model_config) 248 logger.info("Training complete")

    /usr/local/lib/python3.7/dist-packages/autoxgb/autoxgb.py in _process_data(self) 148 def _process_data(self): 149 logger.info("Reading training data") --> 150 train_df = pd.read_csv(self.train_filename) 151 train_df = reduce_memory_usage(train_df) 152 problem_type = self._determine_problem_type(train_df)

    /usr/local/lib/python3.7/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) 312 313 return wrapper

    /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 584 kwds.update(kwds_defaults) 585 --> 586 return _read(filepath_or_buffer, kwds) 587 588

    /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds) 480 481 # Create the parser. --> 482 parser = TextFileReader(filepath_or_buffer, **kwds) 483 484 if chunksize or iterator:

    /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in init(self, f, engine, **kwds) 809 self.options["has_index_names"] = kwds["has_index_names"] 810 --> 811 self._engine = self._make_engine(self.engine) 812 813 def close(self):

    /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, engine) 1038 ) 1039 # error: Too many arguments for "ParserBase" -> 1040 return mapping[engine](self.f, **self.options) # type: ignore[call-arg] 1041 1042 def _failover_to_python(self):

    /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/c_parser_wrapper.py in init(self, src, **kwds) 49 50 # open handles ---> 51 self._open_handles(src, kwds) 52 assert self.handles is not None 53

    /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds) 227 memory_map=kwds.get("memory_map", False), 228 storage_options=kwds.get("storage_options", None), --> 229 errors=kwds.get("encoding_errors", "strict"), 230 ) 231

    /usr/local/lib/python3.7/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 583 584 # read_csv does not know whether the buffer is opened in binary/text mode --> 585 if _is_binary_mode(path_or_buf, mode) and "b" not in mode: 586 mode += "b" 587

    /usr/local/lib/python3.7/dist-packages/pandas/io/common.py in _is_binary_mode(handle, mode) 960 # classes that expect bytes 961 binary_classes = (BufferedIOBase, RawIOBase) --> 962 return isinstance(handle, binary_classes) or "b" in getattr(handle, "mode", mode)

    TypeError: argument of type 'method' is not iterable

    Don't know why..??? Thank you all for your help

    My code: from autoxgb import AutoXGB

    required parameters:

    train_filename = df.iloc[:round(df.shape[0]*.8)] output = "output2"

    optional parameters

    test_filename = None task = None idx = df.index targets = ["Goal"] features = None categorical_features = None use_gpu = False num_folds = 5 seed = 42 num_trials = 100 time_limit = 360 fast = False

    Now its time to train the model!

    axgb = AutoXGB( train_filename=train_filename, output=output, test_filename=test_filename, task=task, idx=idx, targets=targets, features=features, categorical_features=categorical_features, use_gpu=use_gpu, num_folds=num_folds, seed=seed, num_trials=num_trials, time_limit=time_limit, fast=fast, ) axgb.train()

    opened by ludomare 1
  • Time_limit . Doesn't stop the training

    Time_limit . Doesn't stop the training

    I have been dealing with TPS in Kaggle and I have tried auto xgboost. I have set the time limit to 3600*4. But the training didn't stop at 4 hours. Now is at 6.5 hours and still going. Is anything I am doing wrong?

    ps. the first trial took 4 hours to complete

    opened by Kyrpel 0
  • best params and model

    best params and model

    Hi,

    Thanks for building a very useful package. I have two simple questions:

    1. How come only the following params are tuned: {'colsample_bytree': 0.18270180565544739, 'early_stopping_rounds': 401, 'learning_rate': 0.013529250923369278, 'max_depth': 6, 'n_estimators': 20000, 'reg_alpha': 0.0019387086612090178, 'reg_lambda': 5.879563892375361e-08, 'subsample': 0.8925701729066172} what about gamma and other xgBoost parameters? Are they assumed to be default values?

    2. How do I access the best model from the output directory? I plugged in the above best params in my xgb model, but didn't get the same result as autoxgb result showed. Is there a way to access these models and/or the best model in the output directory, so I can run the model on any data to see the results?

    thank you so much, any help will be greatly appreciated.

    p.s. any docs on how to use the output files? There are lot of useful info there, but don't know how to access them smartly.

    opened by mchen172 0
Owner
abhishek thakur
Kaggle: www.kaggle.com/abhishek
abhishek thakur
The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

shibuiwilliam 9 Sep 9, 2022
Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

Telekom Open Source Software 17 Nov 20, 2022
LightGBM + Optuna: no brainer

AutoLGBM LightGBM + Optuna: no brainer auto train lightgbm directly from CSV files auto tune lightgbm using optuna auto serve best lightgbm model usin

Rishiraj Acharya 22 Dec 15, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 3, 2023
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
Mortality risk prediction for COVID-19 patients using XGBoost models

Mortality risk prediction for COVID-19 patients using XGBoost models Using demographic and lab test data received from the HM Hospitales in Spain, I b

null 1 Jan 19, 2022
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

null 92 Dec 14, 2022
Video lie detector using xgboost - A video lie detector using OpenFace and xgboost

video_lie_detector_using_xgboost a video lie detector using OpenFace and xgboost

null 2 Jan 11, 2022
An AutoML Library made with Optuna and PyTorch Lightning

An AutoML Library made with Optuna and PyTorch Lightning Installation Recommended pip install -U gradsflow From source pip install git+https://github.

GradsFlow 294 Dec 17, 2022
easyopt is a super simple yet super powerful optuna-based Hyperparameters Optimization Framework that requires no coding.

easyopt is a super simple yet super powerful optuna-based Hyperparameters Optimization Framework that requires no coding.

Federico Galatolo 9 Feb 4, 2022
The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

shibuiwilliam 9 Sep 9, 2022
Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

Telekom Open Source Software 17 Nov 20, 2022
Numerai tournament example scripts using NN and optuna

numerai_NN_example Numerai tournament example scripts using pytorch NN, lightGBM and optuna https://numer.ai/tournament Performance of my model based

Takahiro Maeda 12 Oct 10, 2022
LightGBM + Optuna: no brainer

AutoLGBM LightGBM + Optuna: no brainer auto train lightgbm directly from CSV files auto tune lightgbm using optuna auto serve best lightgbm model usin

Rishiraj Acharya 22 Dec 15, 2022
Improving XGBoost survival analysis with embeddings and debiased estimators

xgbse: XGBoost Survival Embeddings "There are two cultures in the use of statistical modeling to reach conclusions from data

Loft 242 Dec 30, 2022
A fast xgboost feature selection algorithm

BoostARoota A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers) Why Create Another Algorithm? Automated processes l

Chase DeHan 187 Dec 22, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 3, 2023
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
Mortality risk prediction for COVID-19 patients using XGBoost models

Mortality risk prediction for COVID-19 patients using XGBoost models Using demographic and lab test data received from the HM Hospitales in Spain, I b

null 1 Jan 19, 2022