[UNMAINTAINED] Automated machine learning for analytics & production

Preston Parry

Last update: Jan 2, 2023

Related tags

Deep Learning python data-science machine-learning deep-learning analytics tensorflow scikit-learn keras artificial-intelligence xgboost hyperparameter-optimization lightgbm machine-learning-library deeplearning production-ready feature-engineering machine-learning-pipelines automl gradient-boosting automated-machine-learning

Overview

auto_ml

Automated machine learning for production and analytics

Installation

pip install auto_ml

Getting started

from auto_ml import Predictor
from auto_ml.utils import get_boston_dataset

df_train, df_test = get_boston_dataset()

column_descriptions = {
    'MEDV': 'output',
    'CHAS': 'categorical'
}

ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)

ml_predictor.train(df_train)

ml_predictor.score(df_test, df_test.MEDV)

Show off some more features!

auto_ml is designed for production. Here's an example that includes serializing and loading the trained model, then getting predictions on single dictionaries, roughly the process you'd likely follow to deploy the trained model.

from auto_ml import Predictor
from auto_ml.utils import get_boston_dataset
from auto_ml.utils_models import load_ml_model

# Load data
df_train, df_test = get_boston_dataset()

# Tell auto_ml which column is 'output'
# Also note columns that aren't purely numerical
# Examples include ['nlp', 'date', 'categorical', 'ignore']
column_descriptions = {
  'MEDV': 'output'
  , 'CHAS': 'categorical'
}

ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)

ml_predictor.train(df_train)

# Score the model on test data
test_score = ml_predictor.score(df_test, df_test.MEDV)

# auto_ml is specifically tuned for running in production
# It can get predictions on an individual row (passed in as a dictionary)
# A single prediction like this takes ~1 millisecond
# Here we will demonstrate saving the trained model, and loading it again
file_name = ml_predictor.save()

trained_model = load_ml_model(file_name)

# .predict and .predict_proba take in either:
# A pandas DataFrame
# A list of dictionaries
# A single dictionary (optimized for speed in production evironments)
predictions = trained_model.predict(df_test)
print(predictions)

3rd Party Packages- Deep Learning with TensorFlow & Keras, XGBoost, LightGBM, CatBoost

auto_ml has all of these awesome libraries integrated! Generally, just pass one of them in for model_names. ml_predictor.train(data, model_names=['DeepLearningClassifier'])

Available options are

DeepLearningClassifier and DeepLearningRegressor
XGBClassifier and XGBRegressor
LGBMClassifier and LGBMRegressor
CatBoostClassifier and CatBoostRegressor

All of these projects are ready for production. These projects all have prediction time in the 1 millisecond range for a single prediction, and are able to be serialized to disk and loaded into a new environment after training.

Depending on your machine, they can occasionally be difficult to install, so they are not included in auto_ml's default installation. You are responsible for installing them yourself. auto_ml will run fine without them installed (we check what's installed before choosing which algorithm to use).

Feature Responses

Get linear-model-esque interpretations from non-linear models. See the docs for more information and caveats.

Classification

Binary and multiclass classification are both supported. Note that for now, labels must be integers (0 and 1 for binary classification). auto_ml will automatically detect if it is a binary or multiclass classification problem - you just have to pass in ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=column_descriptions)

Feature Learning

Also known as "finally found a way to make this deep learning stuff useful for my business". Deep Learning is great at learning important features from your data. But the way it turns these learned features into a final prediction is relatively basic. Gradient boosting is great at turning features into accurate predictions, but it doesn't do any feature learning.

In auto_ml, you can now automatically use both types of models for what they're great at. If you pass feature_learning=True, fl_data=some_dataframe to .train(), we will do exactly that: train a deep learning model on your fl_data. We won't ask it for predictions (standard stacking approach), instead, we'll use it's penultimate layer to get it's 10 most useful features. Then we'll train a gradient boosted model (or any other model of your choice) on those features plus all the original features.

Across some problems, we've witnessed this lead to a 5% gain in accuracy, while still making predictions in 1-4 milliseconds, depending on model complexity.

ml_predictor.train(df_train, feature_learning=True, fl_data=df_fl_data)

This feature only supports regression and binary classification currently. The rest of auto_ml supports multiclass classification.

Categorical Ensembling

Ever wanted to train one market for every store/customer, but didn't want to maintain hundreds of thousands of independent models? With ml_predictor.train_categorical_ensemble(), we will handle that for you. You'll still have just one consistent API, ml_predictor.predict(data), but behind this single API will be one model for each category you included in your training data.

Just tell us which column holds the category you want to split on, and we'll handle the rest. As always, saving the model, loading it in a different environment, and getting speedy predictions live in production is baked right in.

ml_predictor.train_categorical_ensemble(df_train, categorical_column='store_name')

More details available in the docs

http://auto-ml.readthedocs.io/en/latest/

Advice

Before you go any further, try running the code. Load up some data (either a DataFrame, or a list of dictionaries, where each dictionary is a row of data). Make a column_descriptions dictionary that tells us which attribute name in each row represents the value we're trying to predict. Pass all that into auto_ml, and see what happens!

Everything else in these docs assumes you have done at least the above. Start there and everything else will build on top. But this part gets you the output you're probably interested in, without unnecessary complexity.

Docs

The full docs are available at https://auto_ml.readthedocs.io Again though, I'd strongly recommend running this on an actual dataset before referencing the docs any futher.

What this project does

Automates the whole machine learning process, making it super easy to use for both analytics, and getting real-time predictions in production.

A quick overview of buzzwords, this project automates:

Analytics (pass in data, and auto_ml will tell you the relationship of each variable to what it is you're trying to predict).
Feature Engineering (particularly around dates, and NLP).
Robust Scaling (turning all values into their scaled versions between the range of 0 and 1, in a way that is robust to outliers, and works with sparse data).
Feature Selection (picking only the features that actually prove useful).
Data formatting (turning a DataFrame or a list of dictionaries into a sparse matrix, one-hot encoding categorical variables, taking the natural log of y for regression problems, etc).
Model Selection (which model works best for your problem- we try roughly a dozen apiece for classification and regression problems, including favorites like XGBoost if it's installed on your machine).
Hyperparameter Optimization (what hyperparameters work best for that model).
Big Data (feed it lots of data- it's fairly efficient with resources).
Unicorns (you could conceivably train it to predict what is a unicorn and what is not).
Ice Cream (mmm, tasty...).
Hugs (this makes it much easier to do your job, hopefully leaving you more time to hug those those you care about).

Running the tests

If you've cloned the source code and are making any changes (highly encouraged!), or just want to make sure everything works in your environment, run nosetests -v tests.

CI is also set up, so if you're developing on this, you can just open a PR, and the tests will run automatically on Travis-CI.

The tests are relatively comprehensive, though as with everything with auto_ml, I happily welcome your contributions here!

Comments

Comparison with other automatic ML libraries?

First, thank you very much for the hard work and awesome project. I think it will get a lot of use in my workflow.

I was surveying the landscape of automatic ML solutions, and found your package along with tpot and auto-sklearn. I am trying to figure out what kind of strengths and weaknesses all these packages have. Would you mind discussing what auto_ml does differently and/or better?

Thanks again.

opened by sergeyf 12

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape

When I train with DeepLearningRegressor with a 5k dataset everything works fine but when I do it on 50k dataset I get this error.

Caused by op u'dense_1/random_normal/RandomStandardNormal', defined at:
  File "salary_predict.py", line 38, in <module>
    ml_predictor.train(df_train, model_names=['DeepLearningRegressor'])
  File "/home/ubuntu/deeparted/auto_ml/predictor.py", line 471, in train
    self.trained_final_model = self.train_ml_estimator(estimator_names, self._scorer, X_df, y)
  File "/home/ubuntu/deeparted/auto_ml/predictor.py", line 674, in train_ml_estimator
    trained_final_model = self.fit_single_pipeline(X_df, y, estimator_names[0], feature_learning=feature_learning)
  File "/home/ubuntu/deeparted/auto_ml/predictor.py", line 548, in fit_single_pipeline
    ppl.fit(X_df, y)
  File "/home/ubuntu/deeparted/auto_ml/utils_model_training.py", line 88, in fit
    self.model.fit(X_fit, y, callbacks=[early_stopping])
  File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/wrappers/scikit_learn.py", line 138, in fit
    self.model = self.build_fn(**self.filter_sk_params(self.build_fn))
  File "/home/ubuntu/deeparted/auto_ml/utils_models.py", line 559, in make_deep_learning_model
    model.add(Dense(hidden_layers[0], input_dim=num_cols, kernel_initializer='normal', kernel_regularizer=regularizers.l2(0.01)))
  File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/models.py", line 433, in add
    layer(x)
  File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 558, in __call__
    self.build(input_shapes[0])
  File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/layers/core.py", line 827, in build
    constraint=self.kernel_constraint)
  File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 391, in add_weight
    weight = K.variable(initializer(shape), dtype=dtype, name=name)
  File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/initializers.py", line 75, in __call__
    dtype=dtype, seed=self.seed)
  File "/home/ubuntu/deeparted/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 3356, in random_normal
    dtype=dtype, seed=seed)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/random_ops.py", line 76, in random_normal
    shape_tensor, dtype, seed=seed1, seed2=seed2)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 220, in _random_standard_normal
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2514, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[47302,1] [[Node: dense_1/random_normal/RandomStandardNormal = RandomStandardNormalT=DT_INT32, dtype=DT_FLOAT, seed=87654321, seed2=5687716, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Tensorflow: Version: 1.1.0 Cuda: 8.0 Cudann: 5.1.10

System Config: Im using P2 (p2.8xlarge) 8 NVIDIA K80 GPUs(192 GB) 64 vCPUs 732 GiB of host memory

Training: batch_size: 50 Dataset size: 50k No of columns: 4 (1 Output, 2 Categorical, 1 Float)

Github Issues: https://github.com/tensorflow/tensorflow/issues/4735 https://github.com/tensorflow/tensorflow/issues/1355 and many more on github

None of this solved the issue. Can anyone help me on this.

opened by sameerpallav 12

User validation on fl_data

Do you have an example of using feature learning? I assumed I could just do feature_learning on the training dataset but I get an error like so when running it on the boston dataset:

ml_predictor.train(df_train, feature_learning=True, fl_data=df_train)


Traceback (most recent call last):
  File "/home/data/.local/lib/python3.6/site-packages/pandas/indexes/base.py", line 2134, in get_loc
    return self._engine.get_loc(key)
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'MEDV'

opened by calz1 11

TypeError: cannot perform reduce with flexible type OR AttributeError: 'Predictor' object has no attribute 'grid_search_pipelines'
Very cool package!

I am trying out auto_ml with this dataset on SMS spam. I added a header row to the file to give it column names and then do the following:

import pandas as p import dill from sklearn.model_selection import train_test_split from auto_ml import Predictor df = p.read_table('/home/data/auto_ml/sms.txt') df_train, df_test = train_test_split(df, test_size=0.5, random_state=42) column_descriptions = { 'spam': 'output' , 'text': 'nlp' } ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=column_descriptions) ml_predictor.train(df_train)

You can see it sort of works because it is telling me about feature importance but then gives :

.... nlp_text_txt: 0.0373 nlp_text_free: 0.0441 Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/auto_ml/predictor.py", line 597, in train if len(self.grid_search_pipelines) > 1: AttributeError: 'Predictor' object has no attribute 'grid_search_pipelines'

Originally I was trying: ml_predictor.train(df_train,ml_for_analytics=True)

and got:

test_score = ml_predictor.score(df_test, df_test.spam) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/auto_ml/predictor.py", line 1014, in score score, probas = self._scorer.score(self.trained_pipeline, X_test, y_test, advanced_scoring=advanced_scoring) File "/usr/local/lib/python2.7/dist-packages/auto_ml/utils_scoring.py", line 268, in score score = self.scoring_func(y, predictions) File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 1884, in brier_score_loss pos_label = y_true.max() File "/usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py", line 26, in _amax return umr_maximum(a, axis, None, out, keepdims) TypeError: cannot perform reduce with flexible type
opened by calz1 11
error during LGBM predict_proba

Hi all..

After long hours of training my model with lightgbm, I just run predict_proba and at first I ran into data_rate_limit in Jupyiter.. then I changed that limit and had to train the model again.. but this time I ran into another error:

Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

can someone help me please? thanks

opened by vkocaman 10

AttributeError: 'XGBRegressor' object has no attribute 'get_fscore'

Testing out auto_ml with XGBoost and ran into this issue. This is against a fresh clone of the XGBoost repository so it looks like their API changed.

predictor.train(x_train, verbose=True, model_names=['XGBRegressor'])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in _get_xgb_feat_importances(self, clf)
    890             # xgb.XGBClassifier.fit() or xgb.XGBRegressor().fit()
--> 891             fscore = clf.booster().get_fscore()
    892         except:

TypeError: 'str' object is not callable

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-37-eafdc24b187b> in <module>()
----> 1 predictor.train(x_train, verbose=True, model_names=['XGBRegressor'])

/home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in train(self, raw_training_data, user_input_func, optimize_final_model, write_gs_param_results_to_file, perform_feature_selection, verbose, X_test, y_test, ml_for_analytics, take_log_of_y, model_names, perform_feature_scaling, calibrate_final_model, _scorer, scoring, verify_features, training_params, grid_search_params, compare_all_models, cv, feature_learning, fl_data)
    469 
    470         # This is our main logic for how we train the final model
--> 471         self.trained_final_model = self.train_ml_estimator(estimator_names, self._scorer, X_df, y)
    472 
    473         # Calibrate the probability predictions from our final model

/home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in train_ml_estimator(self, estimator_names, scoring, X_df, y, feature_learning)
    672         # Use Case 1: Super straightforward: just train a single, non-optimized model
    673         if len(estimator_names) == 1 and self.optimize_final_model != True:
--> 674             trained_final_model = self.fit_single_pipeline(X_df, y, estimator_names[0], feature_learning=feature_learning)
    675 
    676         # Use Case 2: Compare a bunch of models, but don't optimize any of them

/home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in fit_single_pipeline(self, X_df, y, model_name, feature_learning)
    554 
    555         self.trained_final_model = ppl
--> 556         self.print_results(model_name)
    557 
    558         return ppl

/home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in print_results(self, model_name)
    578 
    579         elif self.ml_for_analytics and model_name in ['RandomForestClassifier', 'RandomForestRegressor', 'XGBClassifier', 'XGBRegressor', 'GradientBoostingRegressor', 'GradientBoostingClassifier', 'LGBMRegressor', 'LGBMClassifier']:
--> 580             self._print_ml_analytics_results_random_forest()
    581 
    582 

/home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in _print_ml_analytics_results_random_forest(self)
    938         # XGB's Classifier has a proper .feature_importances_ property, while the XGBRegressor does not.
    939         if final_model_obj.model_name in ['XGBRegressor', 'XGBClassifier']:
--> 940             self._get_xgb_feat_importances(final_model_obj.model)
    941 
    942         else:

/home/ubuntu/venv/lib/python3.5/site-packages/auto_ml/predictor.py in _get_xgb_feat_importances(self, clf)
    893             # Handles case when clf has been created by calling xgb.train.
    894             # Thus, clf is an instance of xgb.Booster.
--> 895             fscore = clf.get_fscore()
    896 
    897         trained_feature_names = self._get_trained_feature_names()

AttributeError: 'XGBRegressor' object has no attribute 'get_fscore'

opened by volker48 9

Error on install - Windows 10

I have progressed through the install. although I got stuck with not having visual C++ 14 installed. I now get the following error at the end of the install. can you please help. What more info do you need.

Command "c:\users\username\appdata\local\programs\python\python35-32\python.exe -u -c "import setuptools, tokenize;file='C:\Users\username\AppData\Local\Temp\pip-build-j_5l4z6_\scipy\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\username\AppData\Local\Temp\pip-5r95bpz0-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\username\AppData\Local\Temp\pip-build-j_5l4z6_\scipy\

opened by bitsam 9

'FinalModelATC' object has no attribute 'feature_ranges'

I'm trying to run your "Getting Started" example on the numerai training data and getting the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-39-aab5c9ba7e0f> in <module>()
      6 # Can pass in type_of_estimator='regressor' as well
      7 
----> 8 ml_predictor.train(df_dict)
      9 # Wait for the machine to learn all the complex and beautiful patterns in your data...
     10 

/Users/alex/anaconda/envs/dsi/lib/python2.7/site-packages/auto_ml/predictor.pyc in train(***failed resolving arguments***)
    553 
    554 
--> 555         self.perform_grid_search_by_model_names(estimator_names, scoring, X_df, y)
    556 
    557         # If we ran GridSearchCV, we will have to pick the best model

/Users/alex/anaconda/envs/dsi/lib/python2.7/site-packages/auto_ml/predictor.pyc in perform_grid_search_by_model_names(self, estimator_names, scoring, X_df, y)
    671 
    672             if self.ml_for_analytics and model_name in ('LogisticRegression', 'RidgeClassifier', 'LinearRegression', 'Ridge'):
--> 673                 self._print_ml_analytics_results_regression()
    674             elif self.ml_for_analytics and model_name in ['RandomForestClassifier', 'RandomForestRegressor', 'XGBClassifier', 'XGBRegressor', 'GradientBoostingRegressor', 'GradientBoostingClassifier']:
    675                 self._print_ml_analytics_results_random_forest()

/Users/alex/anaconda/envs/dsi/lib/python2.7/site-packages/auto_ml/predictor.pyc in _print_ml_analytics_results_regression(self)
    770             trained_coefficients = self.trained_pipeline.named_steps['final_model'].model.coef_
    771 
--> 772         feature_ranges = self.trained_pipeline.named_steps['final_model'].feature_ranges
    773 
    774         # TODO(PRESTON): readability. Can probably do this in a single zip statement.

AttributeError: 'FinalModelATC' object has no attribute 'feature_ranges'

Are you familiar with this type of issue?

opened by akodate 9

far future: take in dataframes or other sparse data structures directly

right now taking in python dictionaries is awesome for it's flexibility and ease of development, but is killing us on memory, even if it is a super sparse data structure.

one workaround we could do for this is described in https://github.com/ClimbsRocks/auto_ml/issues/40, though that feels fairly hacky. taking in a DataFrame seems much more obvious.

opened by ClimbsRocks 9
Fix XGBoost error

It appears that the current XGBoost package that is installed with pip does not have the feature_importance_ attribute. Therefore if you install the xgboost package using pip install xgboost you will be unable to conduct feature extraction from the XGBClassifier or the XGBRegressor object.

I made a workaround after trying to check for feature_importance_ because if the newest version of XGBoost is installed from source then feature_importance_ works fine so it will likely exist in future versions. But currently the version available by pip install xgboost does not provide the attribute.

opened by a-holm 7

Got an unexpected keyword argument 'max_iter' in SGDClassifier

Fail to run the example in README.

from auto_ml import Predictor
from auto_ml.utils import get_boston_dataset

df_train, df_test = get_boston_dataset()

column_descriptions = {
    'MEDV': 'output'
    , 'CHAS': 'categorical'
}

ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)

ml_predictor.train(df_train)

ml_predictor.score(df_test, df_test.MEDV)

And here is the error message.

➜ python ./automl_demo.py
Using TensorFlow backend.
Welcome to auto_ml! We're about to go through and make sense of your data using machine learning, and give you a production-ready pipeline to get predictions with.

If you have any issues, or new feature ideas, let us know at https://github.com/ClimbsRocks/auto_ml
Now using the model training_params that you passed in:
{}
After overwriting our defaults with your values, here are the final params that will be used to initialize the model:
{'presort': False, 'warm_start': True, 'learning_rate': 0.1}
Traceback (most recent call last):
  File "./automl_demo.py", line 13, in <module>
    ml_predictor.train(df_train)
  File "/usr/local/lib/python2.7/site-packages/auto_ml/predictor.py", line 611, in train
    X_df = self.fit_transformation_pipeline(X_df, y, estimator_names)
  File "/usr/local/lib/python2.7/site-packages/auto_ml/predictor.py", line 834, in fit_transformation_pipeline
    ppl = self._construct_pipeline(model_name=model_names[0], keep_cat_features=self.keep_cat_features)
  File "/usr/local/lib/python2.7/site-packages/auto_ml/predictor.py", line 206, in _construct_pipeline
    final_model = utils_models.get_model_from_name(model_name, training_params=params)
  File "/usr/local/lib/python2.7/site-packages/auto_ml/utils_models.py", line 129, in get_model_from_name
    'SGDClassifier': SGDClassifier(max_iter=1000, tol=0.001),
TypeError: __init__() got an unexpected keyword argument 'max_iter'

opened by tobegit3hub 7

get bad score running the sample code
I configure everything and run the whole script and get negative score on the boston datasets. Is it just a sample since i get a bad score is normal ?

The default is only using gradient boosting for the classification and regression and not automatically choose the best model for taining and prediction?
opened by Aun0124 0
pip install automl gets stuck after installing multiprocess-0.70.7

The following is the last snippet in the pip install logs before the installation gets stuck indefinitely:

Collecting multiprocess>=0.70.7 Using cached multiprocess-0.70.11-py3-none-any.whl (98 kB) Using cached multiprocess-0.70.10.zip (2.4 MB) Using cached multiprocess-0.70.9.tar.gz (1.6 MB) Using cached multiprocess-0.70.8.tar.gz (1.6 MB) Using cached multiprocess-0.70.7.tar.gz (1.4 MB)

Even without using the cached copies, the installation gets stuck at this point.

Update: One possible reason for this error could be that \sklearn_deap2-0.2.2-py3.8\evolutionary_search\cv.py incorrectly tries to import check_scoring in the following manner:

from sklearn.metrics.scorer import check_scoring

instead of this:

from sklearn.metrics import check_scoring

opened by akshatpv 2
docs: fix simple typo, puncutation -> punctuation

There is a small typo in docs/source/formatting_data.rst.

Should read punctuation rather than puncutation.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

opened by timgates42 1
Update DataFrameVectorizer.py

DeprecationWarning: The module is deprecated in version 0.21 and removed in version 0.23. This module was removed in the latest scikit-learn version. please remove this module.

opened by karthikreddykuna 1

Releases(v2.7.0)

v2.7.0(Sep 12, 2017)

Ensembling's back for it's alpha release, evolutionary algorithms are doing our hyperparameter search now, we've handled a bunch of dependency updates, and a bunch of smaller performance tweaks.
Source code(tar.gz)
Source code(zip)
v2.4.1(Jul 19, 2017)

Source code(tar.gz)
Source code(zip)
v2.4.0(Jul 14, 2017)

Using quantile regression, we can now return prediction intervals.

Another minor change is adding in a column of absolute changes for feature_responses
Source code(tar.gz)
Source code(zip)
v2.3.5(Jul 9, 2017)

LightGBM and sklearn's gbm now use warm_starting or iterative training to find the best number of trees
Source code(tar.gz)
Source code(zip)
v2.2.1(Jun 13, 2017)

Avoids double training deep learning models, changes how we sort and order features for analytics reporting, and adds a new _all_small_categories category to categorical ensembling.
Source code(tar.gz)
Source code(zip)
v2.2.0(Jun 6, 2017)

Feature responses allows linear-model-like interpretations for non-linear models.
Source code(tar.gz)
Source code(zip)
2.1.5(May 18, 2017)

Avoids mutating input DF Standardizes examples and tests to use load_ml_model()
Source code(tar.gz)
Source code(zip)
2.1.2(May 3, 2017)

Some bugfixes
Source code(tar.gz)
Source code(zip)
2.1(Apr 19, 2017)

Feature learning and categorical ensembling are really cool features that each get us 2-5% accuracy gains!

For full info, check the docs.
Source code(tar.gz)
Source code(zip)
v2.0.0(Apr 4, 2017)
Enough incremental improvements have added up that we're now ready to mark a 2.0 release!

Part of the progress also means deprecating a few unused features that were adding unnecessary complexity and preventing us from implementing new features like ensembling properly.

New changes for the 2.0 release:

Refactored and cleaned up code. Ensembling should now be much easier to add in, and in a way that's fast enough to be used in production (getting predictions from 10 models should take less than 10x as long as getting predictions from 1 model)

Deprecated compute_power

Deprecated several methods for grid searching over transformation_pipeline hyperparameters (different methods for feature selection, whether or not to do feature scaling, etc.). We just directly made a decision to prioritize the final model hyperparameter search.

Deprecated the current implementation of ensembling. It was implemented in such a way that it was not quick enough to make predictions in prod, and thus, did not meet the primary use cases of this project. Part of removing it allows us to reimplement ensembling in a way that is prod-ready.

Deprecated X_test and y_test, except for working with calibrate_final_model.

Added better documentation on features that were in silent alpha release previously.

Improved test coverage!

Major changes since the 1.0 release:

Integrations for deep learning (using TensorFlow and Keras)

Integration of Microsoft's LightGBM, which appears to be a possibly better version of XGBoost

Quite a bit more user logging, warning, and input validation/input cleaning

Quite a few edge case bug fixes and minor performance improvements

Fully automated test suite with decent test coverage!

Better documentation

Support for pandas DataFrames- much more space efficient than lists of dictionaries

Source code(tar.gz)
Source code(zip)
auto_ml-2.0.0-py2.py3-none-any.whl(47.43 KB)
auto_ml-2.0.0.tar.gz(41.64 KB)
v1.12.2(Mar 16, 2017)

This will be our final release before v2.

Includes many recent changes- Deep Learning with Keras/TensorFlow, more efficient hyperparameter optimization, Microsoft's LightGBM, more advanced logging for scoring, and quite a few minor usability improvements (like improved logging when input is not as expected).
Source code(tar.gz)
Source code(zip)
v1.3(Oct 11, 2016)

As of the 1.3 release, we now support taking in Pandas DataFrames, in addition to a list of dictionaries.

This is much more memory efficient, allowing us to now train subpredictors in parallel.

There's also better input validation and message logging to the users.
Source code(tar.gz)
Source code(zip)
auto_ml-1.3-py2.py3-none-any.whl(29.95 KB)
auto_ml-1.3.tar.gz(31.70 KB)

[UNMAINTAINED] Automated machine learning for analytics & production

Related tags

Overview

auto_ml

Installation

Getting started

Show off some more features!

3rd Party Packages- Deep Learning with TensorFlow & Keras, XGBoost, LightGBM, CatBoost

Feature Responses

Classification

Feature Learning

Categorical Ensembling

More details available in the docs

Advice

Docs

What this project does

Running the tests

Comments

Releases(v2.7.0)

v2.7.0(Sep 12, 2017)

v2.4.1(Jul 19, 2017)

v2.4.0(Jul 14, 2017)

v2.3.5(Jul 9, 2017)

v2.2.1(Jun 13, 2017)

v2.2.0(Jun 6, 2017)

2.1.5(May 18, 2017)

2.1.2(May 3, 2017)

2.1(Apr 19, 2017)

v2.0.0(Apr 4, 2017)

v1.12.2(Mar 16, 2017)

v1.3(Oct 11, 2016)

Owner

Preston Parry

Parris, the automated infrastructure setup tool for machine learning algorithms.

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services.

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Apache Spark - A unified analytics engine for large-scale data processing

X-modaler is a versatile and high-performance codebase for cross-modal analytics.

The DL Streamer Pipeline Zoo is a catalog of optimized media and media analytics pipelines.

Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Developed By Google!

Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

Source code for "Progressive Transformers for End-to-End Sign Language Production" (ECCV 2020)

Contra is a lightweight, production ready Tensorflow alternative for solving time series prediction challenges with AI

Reinforcement Learning for Automated Trading

AutoPentest-DRL: Automated Penetration Testing Using Deep Reinforcement Learning