[UNMAINTAINED] Automated machine learning for analytics & production

Overview

auto_ml

Automated machine learning for production and analytics

Build Status Documentation Status PyPI version Coverage Status license

Installation

  • pip install auto_ml

Getting started

from auto_ml import Predictor
from auto_ml.utils import get_boston_dataset

df_train, df_test = get_boston_dataset()

column_descriptions = {
    'MEDV': 'output',
    'CHAS': 'categorical'
}

ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)

ml_predictor.train(df_train)

ml_predictor.score(df_test, df_test.MEDV)

Show off some more features!

auto_ml is designed for production. Here's an example that includes serializing and loading the trained model, then getting predictions on single dictionaries, roughly the process you'd likely follow to deploy the trained model.

from auto_ml import Predictor
from auto_ml.utils import get_boston_dataset
from auto_ml.utils_models import load_ml_model

# Load data
df_train, df_test = get_boston_dataset()

# Tell auto_ml which column is 'output'
# Also note columns that aren't purely numerical
# Examples include ['nlp', 'date', 'categorical', 'ignore']
column_descriptions = {
  'MEDV': 'output'
  , 'CHAS': 'categorical'
}

ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)

ml_predictor.train(df_train)

# Score the model on test data
test_score = ml_predictor.score(df_test, df_test.MEDV)

# auto_ml is specifically tuned for running in production
# It can get predictions on an individual row (passed in as a dictionary)
# A single prediction like this takes ~1 millisecond
# Here we will demonstrate saving the trained model, and loading it again
file_name = ml_predictor.save()

trained_model = load_ml_model(file_name)

# .predict and .predict_proba take in either:
# A pandas DataFrame
# A list of dictionaries
# A single dictionary (optimized for speed in production evironments)
predictions = trained_model.predict(df_test)
print(predictions)

3rd Party Packages- Deep Learning with TensorFlow & Keras, XGBoost, LightGBM, CatBoost

auto_ml has all of these awesome libraries integrated! Generally, just pass one of them in for model_names. ml_predictor.train(data, model_names=['DeepLearningClassifier'])

Available options are

  • DeepLearningClassifier and DeepLearningRegressor
  • XGBClassifier and XGBRegressor
  • LGBMClassifier and LGBMRegressor
  • CatBoostClassifier and CatBoostRegressor

All of these projects are ready for production. These projects all have prediction time in the 1 millisecond range for a single prediction, and are able to be serialized to disk and loaded into a new environment after training.

Depending on your machine, they can occasionally be difficult to install, so they are not included in auto_ml's default installation. You are responsible for installing them yourself. auto_ml will run fine without them installed (we check what's installed before choosing which algorithm to use).

Feature Responses

Get linear-model-esque interpretations from non-linear models. See the docs for more information and caveats.

Classification

Binary and multiclass classification are both supported. Note that for now, labels must be integers (0 and 1 for binary classification). auto_ml will automatically detect if it is a binary or multiclass classification problem - you just have to pass in ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=column_descriptions)

Feature Learning

Also known as "finally found a way to make this deep learning stuff useful for my business". Deep Learning is great at learning important features from your data. But the way it turns these learned features into a final prediction is relatively basic. Gradient boosting is great at turning features into accurate predictions, but it doesn't do any feature learning.

In auto_ml, you can now automatically use both types of models for what they're great at. If you pass feature_learning=True, fl_data=some_dataframe to .train(), we will do exactly that: train a deep learning model on your fl_data. We won't ask it for predictions (standard stacking approach), instead, we'll use it's penultimate layer to get it's 10 most useful features. Then we'll train a gradient boosted model (or any other model of your choice) on those features plus all the original features.

Across some problems, we've witnessed this lead to a 5% gain in accuracy, while still making predictions in 1-4 milliseconds, depending on model complexity.

ml_predictor.train(df_train, feature_learning=True, fl_data=df_fl_data)

This feature only supports regression and binary classification currently. The rest of auto_ml supports multiclass classification.

Categorical Ensembling

Ever wanted to train one market for every store/customer, but didn't want to maintain hundreds of thousands of independent models? With ml_predictor.train_categorical_ensemble(), we will handle that for you. You'll still have just one consistent API, ml_predictor.predict(data), but behind this single API will be one model for each category you included in your training data.

Just tell us which column holds the category you want to split on, and we'll handle the rest. As always, saving the model, loading it in a different environment, and getting speedy predictions live in production is baked right in.

ml_predictor.train_categorical_ensemble(df_train, categorical_column='store_name')

More details available in the docs

http://auto-ml.readthedocs.io/en/latest/

Advice

Before you go any further, try running the code. Load up some data (either a DataFrame, or a list of dictionaries, where each dictionary is a row of data). Make a column_descriptions dictionary that tells us which attribute name in each row represents the value we're trying to predict. Pass all that into auto_ml, and see what happens!

Everything else in these docs assumes you have done at least the above. Start there and everything else will build on top. But this part gets you the output you're probably interested in, without unnecessary complexity.

Docs

The full docs are available at https://auto_ml.readthedocs.io Again though, I'd strongly recommend running this on an actual dataset before referencing the docs any futher.

What this project does

Automates the whole machine learning process, making it super easy to use for both analytics, and getting real-time predictions in production.

A quick overview of buzzwords, this project automates:

  • Analytics (pass in data, and auto_ml will tell you the relationship of each variable to what it is you're trying to predict).
  • Feature Engineering (particularly around dates, and NLP).
  • Robust Scaling (turning all values into their scaled versions between the range of 0 and 1, in a way that is robust to outliers, and works with sparse data).
  • Feature Selection (picking only the features that actually prove useful).
  • Data formatting (turning a DataFrame or a list of dictionaries into a sparse matrix, one-hot encoding categorical variables, taking the natural log of y for regression problems, etc).
  • Model Selection (which model works best for your problem- we try roughly a dozen apiece for classification and regression problems, including favorites like XGBoost if it's installed on your machine).
  • Hyperparameter Optimization (what hyperparameters work best for that model).
  • Big Data (feed it lots of data- it's fairly efficient with resources).
  • Unicorns (you could conceivably train it to predict what is a unicorn and what is not).
  • Ice Cream (mmm, tasty...).
  • Hugs (this makes it much easier to do your job, hopefully leaving you more time to hug those those you care about).

Running the tests

If you've cloned the source code and are making any changes (highly encouraged!), or just want to make sure everything works in your environment, run nosetests -v tests.

CI is also set up, so if you're developing on this, you can just open a PR, and the tests will run automatically on Travis-CI.

The tests are relatively comprehensive, though as with everything with auto_ml, I happily welcome your contributions here!

Analytics

Comments
  • Comparison with other automatic ML libraries?

    Comparison with other automatic ML libraries?

    First, thank you very much for the hard work and awesome project. I think it will get a lot of use in my workflow.

    I was surveying the landscape of automatic ML solutions, and found your package along with tpot and auto-sklearn. I am trying to figure out what kind of strengths and weaknesses all these packages have. Would you mind discussing what auto_ml does differently and/or better?

    Thanks again.

    opened by sergeyf 12
  • ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape

    ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape

    When I train with DeepLearningRegressor with a 5k dataset everything works fine but when I do it on 50k dataset I get this error.

    Caused by op u'dense_1/random_normal/RandomStandardNormal', defined at:
      File "salary_predict.py", line 38, in <module>
        ml_predictor.train(df_train, model_names=['DeepLearningRegressor'])
      File "/home/ubuntu/deeparted/auto_ml/predictor.py", line 471, in train
        self.trained_final_model = self.train_ml_estimator(estimator_names, self._scorer, X_df, y)
      File "/home/ubuntu/deeparted/auto_ml/predictor.py", line 674, in train_ml_estimator
        trained_final_model = self.fit_single_pipeline(X_df, y, estimator_names[0], feature_learning=feature_learning)
      File "/home/ubuntu/deeparted/auto_ml/predictor.py", line 548, in fit_single_pipeline
        ppl.fit(X_df, y)
      File "/home/ubuntu/deeparted/auto_ml/utils_model_training.py", line 88, in fit
        self.model.fit(X_fit, y, callbacks=[early_stopping])
      File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/wrappers/scikit_learn.py", line 138, in fit
        self.model = self.build_fn(**self.filter_sk_params(self.build_fn))
      File "/home/ubuntu/deeparted/auto_ml/utils_models.py", line 559, in make_deep_learning_model
        model.add(Dense(hidden_layers[0], input_dim=num_cols, kernel_initializer='normal', kernel_regularizer=regularizers.l2(0.01)))
      File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/models.py", line 433, in add
        layer(x)
      File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 558, in __call__
        self.build(input_shapes[0])
      File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/layers/core.py", line 827, in build
        constraint=self.kernel_constraint)
      File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
        return func(*args, **kwargs)
      File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 391, in add_weight
        weight = K.variable(initializer(shape), dtype=dtype, name=name)
      File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/initializers.py", line 75, in __call__
        dtype=dtype, seed=self.seed)
      File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 3356, in random_normal
        dtype=dtype, seed=seed)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/random_ops.py", line 76, in random_normal
        shape_tensor, dtype, seed=seed1, seed2=seed2)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 220, in _random_standard_normal
        name=name)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
        op_def=op_def)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2514, in create_op
        original_op=self._default_original_op, op_def=op_def)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
        self._traceback = _extract_stack()
    

    ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[47302,1] [[Node: dense_1/random_normal/RandomStandardNormal = RandomStandardNormalT=DT_INT32, dtype=DT_FLOAT, seed=87654321, seed2=5687716, _device="/job:localhost/replica:0/task:0/gpu:0"]]

    Tensorflow: Version: 1.1.0 Cuda: 8.0 Cudann: 5.1.10

    System Config: Im using P2 (p2.8xlarge) 8 NVIDIA K80 GPUs(192 GB) 64 vCPUs 732 GiB of host memory

    Training: batch_size: 50 Dataset size: 50k No of columns: 4 (1 Output, 2 Categorical, 1 Float)

    Github Issues: https://github.com/tensorflow/tensorflow/issues/4735 https://github.com/tensorflow/tensorflow/issues/1355 and many more on github

    None of this solved the issue. Can anyone help me on this.

    opened by sameerpallav 12
  • User validation on fl_data

    User validation on fl_data

    Do you have an example of using feature learning? I assumed I could just do feature_learning on the training dataset but I get an error like so when running it on the boston dataset:

    ml_predictor.train(df_train, feature_learning=True, fl_data=df_train)

    
    Traceback (most recent call last):
      File "/home/data/.local/lib/python3.6/site-packages/pandas/indexes/base.py", line 2134, in get_loc
        return self._engine.get_loc(key)
      File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
      File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
      File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
      File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
    KeyError: 'MEDV'
    
    opened by calz1 11
  • TypeError: cannot perform reduce with flexible type OR AttributeError: 'Predictor' object has no attribute 'grid_search_pipelines'

    TypeError: cannot perform reduce with flexible type OR AttributeError: 'Predictor' object has no attribute 'grid_search_pipelines'

    Very cool package!

    I am trying out auto_ml with this dataset on SMS spam. I added a header row to the file to give it column names and then do the following:

    import pandas as p  
    import dill  
    from sklearn.model_selection import train_test_split   
    from auto_ml import Predictor 
    
    df = p.read_table('/home/data/auto_ml/sms.txt')
    df_train, df_test = train_test_split(df, test_size=0.5, random_state=42)
    column_descriptions = {
      'spam': 'output'
      , 'text': 'nlp'
    }
    
    ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=column_descriptions)
    ml_predictor.train(df_train)
    

    You can see it sort of works because it is telling me about feature importance but then gives :

    .... nlp_text_txt: 0.0373 nlp_text_free: 0.0441 Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/auto_ml/predictor.py", line 597, in train if len(self.grid_search_pipelines) > 1: AttributeError: 'Predictor' object has no attribute 'grid_search_pipelines'

    Originally I was trying: ml_predictor.train(df_train,ml_for_analytics=True)

    and got:

    test_score = ml_predictor.score(df_test, df_test.spam) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/auto_ml/predictor.py", line 1014, in score score, probas = self._scorer.score(self.trained_pipeline, X_test, y_test, advanced_scoring=advanced_scoring) File "/usr/local/lib/python2.7/dist-packages/auto_ml/utils_scoring.py", line 268, in score score = self.scoring_func(y, predictions) File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 1884, in brier_score_loss pos_label = y_true.max() File "/usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py", line 26, in _amax return umr_maximum(a, axis, None, out, keepdims) TypeError: cannot perform reduce with flexible type

    opened by calz1 11
  • error during LGBM predict_proba

    error during LGBM predict_proba

    Hi all..

    After long hours of training my model with lightgbm, I just run predict_proba and at first I ran into data_rate_limit in Jupyiter.. then I changed that limit and had to train the model again.. but this time I ran into another error:

    Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

    can someone help me please? thanks

    opened by vkocaman 10
  • AttributeError: 'XGBRegressor' object has no attribute 'get_fscore'

    AttributeError: 'XGBRegressor' object has no attribute 'get_fscore'

    Testing out auto_ml with XGBoost and ran into this issue. This is against a fresh clone of the XGBoost repository so it looks like their API changed.

    predictor.train(x_train, verbose=True, model_names=['XGBRegressor'])

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    /home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in _get_xgb_feat_importances(self, clf)
        890             # xgb.XGBClassifier.fit() or xgb.XGBRegressor().fit()
    --> 891             fscore = clf.booster().get_fscore()
        892         except:
    
    TypeError: 'str' object is not callable
    
    During handling of the above exception, another exception occurred:
    
    AttributeError                            Traceback (most recent call last)
    <ipython-input-37-eafdc24b187b> in <module>()
    ----> 1 predictor.train(x_train, verbose=True, model_names=['XGBRegressor'])
    
    /home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in train(self, raw_training_data, user_input_func, optimize_final_model, write_gs_param_results_to_file, perform_feature_selection, verbose, X_test, y_test, ml_for_analytics, take_log_of_y, model_names, perform_feature_scaling, calibrate_final_model, _scorer, scoring, verify_features, training_params, grid_search_params, compare_all_models, cv, feature_learning, fl_data)
        469 
        470         # This is our main logic for how we train the final model
    --> 471         self.trained_final_model = self.train_ml_estimator(estimator_names, self._scorer, X_df, y)
        472 
        473         # Calibrate the probability predictions from our final model
    
    /home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in train_ml_estimator(self, estimator_names, scoring, X_df, y, feature_learning)
        672         # Use Case 1: Super straightforward: just train a single, non-optimized model
        673         if len(estimator_names) == 1 and self.optimize_final_model != True:
    --> 674             trained_final_model = self.fit_single_pipeline(X_df, y, estimator_names[0], feature_learning=feature_learning)
        675 
        676         # Use Case 2: Compare a bunch of models, but don't optimize any of them
    
    /home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in fit_single_pipeline(self, X_df, y, model_name, feature_learning)
        554 
        555         self.trained_final_model = ppl
    --> 556         self.print_results(model_name)
        557 
        558         return ppl
    
    /home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in print_results(self, model_name)
        578 
        579         elif self.ml_for_analytics and model_name in ['RandomForestClassifier', 'RandomForestRegressor', 'XGBClassifier', 'XGBRegressor', 'GradientBoostingRegressor', 'GradientBoostingClassifier', 'LGBMRegressor', 'LGBMClassifier']:
    --> 580             self._print_ml_analytics_results_random_forest()
        581 
        582 
    
    /home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in _print_ml_analytics_results_random_forest(self)
        938         # XGB's Classifier has a proper .feature_importances_ property, while the XGBRegressor does not.
        939         if final_model_obj.model_name in ['XGBRegressor', 'XGBClassifier']:
    --> 940             self._get_xgb_feat_importances(final_model_obj.model)
        941 
        942         else:
    
    /home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in _get_xgb_feat_importances(self, clf)
        893             # Handles case when clf has been created by calling xgb.train.
        894             # Thus, clf is an instance of xgb.Booster.
    --> 895             fscore = clf.get_fscore()
        896 
        897         trained_feature_names = self._get_trained_feature_names()
    
    AttributeError: 'XGBRegressor' object has no attribute 'get_fscore'
    
    opened by volker48 9
  • Error on install - Windows 10

    Error on install - Windows 10

    I have progressed through the install. although I got stuck with not having visual C++ 14 installed. I now get the following error at the end of the install. can you please help. What more info do you need.

    Command "c:\users\username\appdata\local\programs\python\python35-32\python.exe -u -c "import setuptools, tokenize;file='C:\Users\username\AppData\Local\Temp\pip-build-j_5l4z6_\scipy\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\username\AppData\Local\Temp\pip-5r95bpz0-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\username\AppData\Local\Temp\pip-build-j_5l4z6_\scipy\

    opened by bitsam 9
  • 'FinalModelATC' object has no attribute 'feature_ranges'

    'FinalModelATC' object has no attribute 'feature_ranges'

    I'm trying to run your "Getting Started" example on the numerai training data and getting the following error:

    AttributeError                            Traceback (most recent call last)
    <ipython-input-39-aab5c9ba7e0f> in <module>()
          6 # Can pass in type_of_estimator='regressor' as well
          7 
    ----> 8 ml_predictor.train(df_dict)
          9 # Wait for the machine to learn all the complex and beautiful patterns in your data...
         10 
    
    /Users/alex/anaconda/envs/dsi/lib/python2.7/site-packages/auto_ml/predictor.pyc in train(***failed resolving arguments***)
        553 
        554 
    --> 555         self.perform_grid_search_by_model_names(estimator_names, scoring, X_df, y)
        556 
        557         # If we ran GridSearchCV, we will have to pick the best model
    
    /Users/alex/anaconda/envs/dsi/lib/python2.7/site-packages/auto_ml/predictor.pyc in perform_grid_search_by_model_names(self, estimator_names, scoring, X_df, y)
        671 
        672             if self.ml_for_analytics and model_name in ('LogisticRegression', 'RidgeClassifier', 'LinearRegression', 'Ridge'):
    --> 673                 self._print_ml_analytics_results_regression()
        674             elif self.ml_for_analytics and model_name in ['RandomForestClassifier', 'RandomForestRegressor', 'XGBClassifier', 'XGBRegressor', 'GradientBoostingRegressor', 'GradientBoostingClassifier']:
        675                 self._print_ml_analytics_results_random_forest()
    
    /Users/alex/anaconda/envs/dsi/lib/python2.7/site-packages/auto_ml/predictor.pyc in _print_ml_analytics_results_regression(self)
        770             trained_coefficients = self.trained_pipeline.named_steps['final_model'].model.coef_
        771 
    --> 772         feature_ranges = self.trained_pipeline.named_steps['final_model'].feature_ranges
        773 
        774         # TODO(PRESTON): readability. Can probably do this in a single zip statement.
    
    AttributeError: 'FinalModelATC' object has no attribute 'feature_ranges'
    

    Are you familiar with this type of issue?

    opened by akodate 9
  • far future: take in dataframes or other sparse data structures directly

    far future: take in dataframes or other sparse data structures directly

    right now taking in python dictionaries is awesome for it's flexibility and ease of development, but is killing us on memory, even if it is a super sparse data structure.

    one workaround we could do for this is described in https://github.com/ClimbsRocks/auto_ml/issues/40, though that feels fairly hacky. taking in a DataFrame seems much more obvious.

    opened by ClimbsRocks 9
  • Fix XGBoost error

    Fix XGBoost error

    It appears that the current XGBoost package that is installed with pip does not have the feature_importance_ attribute. Therefore if you install the xgboost package using pip install xgboost you will be unable to conduct feature extraction from the XGBClassifier or the XGBRegressor object.

    I made a workaround after trying to check for feature_importance_ because if the newest version of XGBoost is installed from source then feature_importance_ works fine so it will likely exist in future versions. But currently the version available by pip install xgboost does not provide the attribute.

    opened by a-holm 7
  • Got an unexpected keyword argument 'max_iter' in SGDClassifier

    Got an unexpected keyword argument 'max_iter' in SGDClassifier

    Fail to run the example in README.

    from auto_ml import Predictor
    from auto_ml.utils import get_boston_dataset
    
    df_train, df_test = get_boston_dataset()
    
    column_descriptions = {
        'MEDV': 'output'
        , 'CHAS': 'categorical'
    }
    
    ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)
    
    ml_predictor.train(df_train)
    
    ml_predictor.score(df_test, df_test.MEDV)
    

    And here is the error message.

    ➜ python ./automl_demo.py
    Using TensorFlow backend.
    Welcome to auto_ml! We're about to go through and make sense of your data using machine learning, and give you a production-ready pipeline to get predictions with.
    
    If you have any issues, or new feature ideas, let us know at https://github.com/ClimbsRocks/auto_ml
    Now using the model training_params that you passed in:
    {}
    After overwriting our defaults with your values, here are the final params that will be used to initialize the model:
    {'presort': False, 'warm_start': True, 'learning_rate': 0.1}
    Traceback (most recent call last):
      File "./automl_demo.py", line 13, in <module>
        ml_predictor.train(df_train)
      File "/usr/local/lib/python2.7/site-packages/auto_ml/predictor.py", line 611, in train
        X_df = self.fit_transformation_pipeline(X_df, y, estimator_names)
      File "/usr/local/lib/python2.7/site-packages/auto_ml/predictor.py", line 834, in fit_transformation_pipeline
        ppl = self._construct_pipeline(model_name=model_names[0], keep_cat_features=self.keep_cat_features)
      File "/usr/local/lib/python2.7/site-packages/auto_ml/predictor.py", line 206, in _construct_pipeline
        final_model = utils_models.get_model_from_name(model_name, training_params=params)
      File "/usr/local/lib/python2.7/site-packages/auto_ml/utils_models.py", line 129, in get_model_from_name
        'SGDClassifier': SGDClassifier(max_iter=1000, tol=0.001),
    TypeError: __init__() got an unexpected keyword argument 'max_iter'
    
    opened by tobegit3hub 7
  • get bad score running the sample code

    get bad score running the sample code

    1. I configure everything and run the whole script and get negative score on the boston datasets. Is it just a sample since i get a bad score is normal ?

    2. The default is only using gradient boosting for the classification and regression and not automatically choose the best model for taining and prediction?

    opened by Aun0124 0
  • pip install automl gets stuck after installing multiprocess-0.70.7

    pip install automl gets stuck after installing multiprocess-0.70.7

    The following is the last snippet in the pip install logs before the installation gets stuck indefinitely:

    Collecting multiprocess>=0.70.7 Using cached multiprocess-0.70.11-py3-none-any.whl (98 kB) Using cached multiprocess-0.70.10.zip (2.4 MB) Using cached multiprocess-0.70.9.tar.gz (1.6 MB) Using cached multiprocess-0.70.8.tar.gz (1.6 MB) Using cached multiprocess-0.70.7.tar.gz (1.4 MB)

    Even without using the cached copies, the installation gets stuck at this point.

    Update: One possible reason for this error could be that \sklearn_deap2-0.2.2-py3.8\evolutionary_search\cv.py incorrectly tries to import check_scoring in the following manner:

    from sklearn.metrics.scorer import check_scoring

    instead of this:

    from sklearn.metrics import check_scoring

    opened by akshatpv 2
  • docs: fix simple typo, puncutation -> punctuation

    docs: fix simple typo, puncutation -> punctuation

    There is a small typo in docs/source/formatting_data.rst.

    Should read punctuation rather than puncutation.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 1
  • Update DataFrameVectorizer.py

    Update DataFrameVectorizer.py

    DeprecationWarning: The module is deprecated in version 0.21 and removed in version 0.23. This module was removed in the latest scikit-learn version. please remove this module.

    opened by karthikreddykuna 1
Releases(v2.7.0)
  • v2.7.0(Sep 12, 2017)

    Ensembling's back for it's alpha release, evolutionary algorithms are doing our hyperparameter search now, we've handled a bunch of dependency updates, and a bunch of smaller performance tweaks.

    Source code(tar.gz)
    Source code(zip)
  • v2.4.0(Jul 14, 2017)

    Using quantile regression, we can now return prediction intervals.

    Another minor change is adding in a column of absolute changes for feature_responses

    Source code(tar.gz)
    Source code(zip)
  • v2.3.5(Jul 9, 2017)

  • v2.2.1(Jun 13, 2017)

    Avoids double training deep learning models, changes how we sort and order features for analytics reporting, and adds a new _all_small_categories category to categorical ensembling.

    Source code(tar.gz)
    Source code(zip)
  • v2.2.0(Jun 6, 2017)

  • 2.1.5(May 18, 2017)

  • 2.1.2(May 3, 2017)

  • 2.1(Apr 19, 2017)

    Feature learning and categorical ensembling are really cool features that each get us 2-5% accuracy gains!

    For full info, check the docs.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Apr 4, 2017)

    Enough incremental improvements have added up that we're now ready to mark a 2.0 release!

    Part of the progress also means deprecating a few unused features that were adding unnecessary complexity and preventing us from implementing new features like ensembling properly.

    New changes for the 2.0 release:

    • Refactored and cleaned up code. Ensembling should now be much easier to add in, and in a way that's fast enough to be used in production (getting predictions from 10 models should take less than 10x as long as getting predictions from 1 model)
    • Deprecated compute_power
    • Deprecated several methods for grid searching over transformation_pipeline hyperparameters (different methods for feature selection, whether or not to do feature scaling, etc.). We just directly made a decision to prioritize the final model hyperparameter search.
    • Deprecated the current implementation of ensembling. It was implemented in such a way that it was not quick enough to make predictions in prod, and thus, did not meet the primary use cases of this project. Part of removing it allows us to reimplement ensembling in a way that is prod-ready.
    • Deprecated X_test and y_test, except for working with calibrate_final_model.
    • Added better documentation on features that were in silent alpha release previously.
    • Improved test coverage!

    Major changes since the 1.0 release:

    • Integrations for deep learning (using TensorFlow and Keras)
    • Integration of Microsoft's LightGBM, which appears to be a possibly better version of XGBoost
    • Quite a bit more user logging, warning, and input validation/input cleaning
    • Quite a few edge case bug fixes and minor performance improvements
    • Fully automated test suite with decent test coverage!
    • Better documentation
    • Support for pandas DataFrames- much more space efficient than lists of dictionaries
    Source code(tar.gz)
    Source code(zip)
    auto_ml-2.0.0-py2.py3-none-any.whl(47.43 KB)
    auto_ml-2.0.0.tar.gz(41.64 KB)
  • v1.12.2(Mar 16, 2017)

    This will be our final release before v2.

    Includes many recent changes- Deep Learning with Keras/TensorFlow, more efficient hyperparameter optimization, Microsoft's LightGBM, more advanced logging for scoring, and quite a few minor usability improvements (like improved logging when input is not as expected).

    Source code(tar.gz)
    Source code(zip)
  • v1.3(Oct 11, 2016)

Owner
Preston Parry
Rock Climber, Biker, Community Builder, Teacher, data scientist & machine learning geek
Preston Parry
Parris, the automated infrastructure setup tool for machine learning algorithms.

README Parris, the automated infrastructure setup tool for machine learning algorithms. What Is This Tool? Parris is a tool for automating the trainin

Joseph Greene 319 Aug 2, 2022
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Machine Learning From Scratch About Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The purpose

Erik Linder-Norén 21.8k Jan 9, 2023
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Vowpal Wabbit 8.1k Jan 6, 2023
banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services.

banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services. This library is developed by Bandit ML and ex-authors of Facebook's applied reinforcement learning platform, Reagent.

Bandit ML 51 Dec 22, 2022
Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pytorch Lightning 1.4k Jan 1, 2023
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

null 419 Jan 3, 2023
Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

The Apache Software Foundation 34.7k Jan 4, 2023
X-modaler is a versatile and high-performance codebase for cross-modal analytics.

X-modaler X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules i

null 910 Dec 28, 2022
The DL Streamer Pipeline Zoo is a catalog of optimized media and media analytics pipelines.

The DL Streamer Pipeline Zoo is a catalog of optimized media and media analytics pipelines. It includes tools for downloading pipelines and their dependencies and tools for measuring their performace.

null 8 Dec 4, 2022
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022
This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Developed By Google!

Machine Learning Hand Detector This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Dev

Popstar Idhant 3 Feb 25, 2022
Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Yolov5 running on TorchServe (GPU compatible) ! This is a dockerfile to run TorchServe for Yolo v5 object detection model. (TorchServe (PyTorch librar

null 82 Nov 29, 2022
TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline

null 193 Dec 22, 2022
Source code for "Progressive Transformers for End-to-End Sign Language Production" (ECCV 2020)

Progressive Transformers for End-to-End Sign Language Production Source code for "Progressive Transformers for End-to-End Sign Language Production" (B

null 58 Dec 21, 2022
Contra is a lightweight, production ready Tensorflow alternative for solving time series prediction challenges with AI

Contra AI Engine A lightweight, production ready Tensorflow alternative developed by Styvio styvio.com » How to Use · Report Bug · Request Feature Tab

styvio 14 May 25, 2022
Reinforcement Learning for Automated Trading

Reinforcement Learning for Automated Trading This thesis has been realized for the obtention of the Master's in Mathematical Engineering at the Polite

Pierpaolo Necchi 80 Jun 19, 2022
AutoPentest-DRL: Automated Penetration Testing Using Deep Reinforcement Learning

AutoPentest-DRL: Automated Penetration Testing Using Deep Reinforcement Learning AutoPentest-DRL is an automated penetration testing framework based o

Cyber Range Organization and Design Chair 217 Jan 1, 2023