Provide an input CSV and a target field to predict, generate a model + code to run it.

Max Woolf

Last update: Jan 4, 2023

Related tags

Machine Learning python machine-learning tensorflow keras xgboost automl

Overview

automl-gs

Give an input CSV file and a target field you want to predict to automl-gs, and get a trained high-performing machine learning or deep learning model plus native Python code pipelines allowing you to integrate that model into any prediction workflow. No black box: you can see exactly how the data is processed, how the model is constructed, and you can make tweaks as necessary.

automl-gs is an AutoML tool which, unlike Microsoft's NNI, Uber's Ludwig, and TPOT, offers a zero code/model definition interface to getting an optimized model and data transformation pipeline in multiple popular ML/DL frameworks, with minimal Python dependencies (pandas + scikit-learn + your framework of choice). automl-gs is designed for citizen data scientists and engineers without a deep statistical background under the philosophy that you don't need to know any modern data preprocessing and machine learning engineering techniques to create a powerful prediction workflow.

Nowadays, the cost of computing many different models and hyperparameters is much lower than the opportunity cost of an data scientist's time. automl-gs is a Python 3 module designed to abstract away the common approaches to transforming tabular data, architecting machine learning/deep learning models, and performing random hyperparameter searches to identify the best-performing model. This allows data scientists and researchers to better utilize their time on model performance optimization.

Generates native Python code; no platform lock-in, and no need to use automl-gs after the model script is created.
Train model configurations super-fast for free using a TPU and TensorFlow in Google Colaboratory. (in Beta: you can access the Colaboratory notebook here).
Handles messy datasets that normally require manual intervention, such as datetime/categorical encoding and spaced/parenthesized column names.
Each part of the generated model pipeline is its own function w/ docstrings, making it much easier to integrate into production workflows.
Extremely detailed metrics reporting for every trial stored in a tidy CSV, allowing you to identify and visualize model strengths and weaknesses.
Correct serialization of data pipeline encoders on disk (i.e. no pickled Python objects!)
Retrain the generated model on new data without making any code/pipeline changes.
Quit the hyperparameter search at any time, as the results are saved after each trial.
Training progress bars with ETAs for both the overall experiment and per-epoch during the experiment.

The models generated by automl-gs are intended to give a very strong baseline for solving a given problem; they're not the end-all-be-all that often accompanies the AutoML hype, but the resulting code is easily tweakable to improve from the baseline.

You can view the hyperparameters and their values here, and the metrics that can be optimized here. Some of the more controversial design decisions for the generated models are noted in DESIGN.md.

Framework Support

Currently automl-gs supports the generation of models for regression and classification problems using the following Python frameworks:

TensorFlow (via tf.keras) | tensorflow
XGBoost (w/ histogram binning) | xgboost

To be implemented:

Catboost | catboost
LightGBM | lightgbm

How to Use

automl-gs can be installed via pip:

pip3 install automl_gs

You will also need to install the corresponding ML/DL framework (e.g. tensorflow/tensorflow-gpu for TensorFlow, xgboost for xgboost, etc.)

After that, you can run it directly from the command line. For example, with the famous Titanic dataset:

automl_gs titanic.csv Survived

If you want to use a different framework or configure the training, you can do it with flags:

automl_gs titanic.csv Survived --framework xgboost --num_trials 1000

You may also invoke automl-gs directly from Python. (e.g. via a Jupyter Notebook)

from automl_gs import automl_grid_search

automl_grid_search('titanic.csv', 'Survived')

The output of the automl-gs training is:

A timestamped folder (e.g. automl_tensorflow_20190317_020434) with contains:
- model.py: The generated model file.
- pipeline.py: The generated pipeline file.
- requirements.txt: The generated requirements file.
- /encoders: A folder containing JSON-serialized encoder files
- /metadata: A folder containing training statistics + other cool stuff not yet implemented!
- The model itself (format depends on framework)
automl_results.csv: A CSV containing the training results after each epoch and the hyperparameters used to train at that time.

Once the training is done, you can run the generated files from the command line within the generated folder above.

To predict:

python3 model.py -d data.csv -m predict

To retrain the model on new data:

python3 model.py -d data.csv -m train

CLI Arguments/Function Parameters

You can view these at any time by running automl_gs -h in the command line.

csv_path: Path to the CSV file (must be in the current directory) [Required]
target_field: Target field to predict [Required]
target_metric: Target metric to optimize [Default: Automatically determined depending on problem type]
framework: Machine learning framework to use [Default: 'tensorflow']
model_name: Name of the model (if you want to train models with different names) [Default: 'automl']
num_trials: Number of trials / different hyperparameter combos to test. [Default: 100]
split: Train-validation split when training the models [Default: 0.7]
num_epochs: Number of epochs / passes through the data when training the models. [Default: 20]
col_types: Dictionary of fields:data types to use to override automl-gs's guesses. (only when using in Python) [Default: {}]
gpu: For non-Tensorflow frameworks and Pascal-or-later GPUs, boolean to determine whether to use GPU-optimized training methods (TensorFlow can detect it automatically) [Default: False]
tpu_address: For TensorFlow, hardware address of the TPU on the system. [Default: None]

Examples

For a quick Hello World on how to use automl-gs, see this Jupyter Notebook.

Due to the size of some examples w/ generated code and accompanying data visualizations, they are maintained in a separate repository. (and also explain why there are two distinct "levels" in the example viz above!)

How automl-gs Works

TL;DR: auto-ml gs generates raw Python code using Jinja templates and trains a model using the generated code in a subprocess: repeat using different hyperparameters until done and save the best model.

automl-gs loads a given CSV and infers the data type of each column to be fed into the model. Then it tries a ETL strategy for each column field as determined by the hyperparameters; for example, a Datetime field has its hour and dayofweek binary-encoded by default, but hyperparameters may dictate the encoding of month and year as additional model fields. ETL strategies are optimized for frameworks; TensorFlow for example will use text embeddings, while other frameworks will use CountVectorizers to encode text (when training, TensorFlow will also used a shared text encoder via Keras's functional API). automl-gs then creates a statistical model with the specified framework. Both the model ETL functions and model construction functions are saved as a generated Python script.

automl-gs then runs the generated training script as if it was a typical user. Once the model is trained, automl-gs saves the training results in its own CSV, along with all the hyperparameters used to train the model. automl-gs then repeats the task with another set of hyperparameters, until the specified number of trials is hit or the user kills the script.

The best model Python script is kept after each trial, which can then easily be integrated into other scripts, or run directly to get the prediction results on a new dataset.

Helpful Notes

It is the user's responsibility to ensure the input dataset is high-quality. No model hyperparameter search will provide good research on flawed/unbalanced datasets. Relatedly, hyperparameter optimization may provide optimistic predictions on the validation set, which may not necessarily match the model performance in the real world.
A neural network approach alone may not necessarily be the best approach. Try using xgboost. The results may surprise you!
automl-gs is only attempting to solve tabular data problems. If you have a more complicated problem to solve (e.g. predicting a sequence of outputs), I recommend using Microsoft's NNI and Uber's Ludwig as noted in the introduction.

Known Issues

Issues when using Anaconda (#8). Use an installed Python is possible.
Issues when using Windows (#13)
Issues when a field name in the input dataset starts with a number (#18)

Future Work

Feature development will continue on automl-gs as long as there is interest in the package.

Top Priority

Add more frameworks
Results visualization (via plotnine)
Holiday support for datetimes
Remove redundant generated code
Native distributed/high level automation support (Polyaxon/Kubernetes, Airflow)
Image field support (both as a CSV column field, and a special flow mode to take advantage of hyperparameter tuning)
PyTorch model code generation.

Elsework

Generate script given an explicit set of hyperparameters
More hyperparameters.
Bayesian hyperparameter search for standalone version.
Support for generating model code for R/Julia
Tool for generating a Flask/Starlette REST API from a trained model script
Allow passing an explicit, pre-defined test set CSV.

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

License

MIT

The code generated by automl-gs is unlicensed; the owner of the generated code can decide the license.

Comments

YAMLLoadWarning disrupting progress bar

Trying out the example titanic dataset in a conda environment and encountered the following error very frequently such that it disrupts the tqdm progress bar.

/anaconda3/envs/automl-gs/lib/python3.6/site-packages/automl_gs/utils_automl.py:270:
YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default 
Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  metrics = yaml.load(f)

opened by remykarem 3

fix #13

This initial commit enables automl-gs to work on win10. I can test on win8 and linux, but don't have access to a Mac to make sure everything still works normally there.

opened by evan-burke 3

FileNotFoundError

Hi,

Just trying to work through your example colab notebook. I work through the cells, upload the titanic.csv, and get

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

<ipython-input-3-9f452c025bdd> in <module>()
      2                    target_field='origin',
      3                    model_name='tpu',
----> 4                    tpu_address = tpu_address)

5 frames

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1889         kwds["usecols"] = self.usecols
   1890 
-> 1891         self._reader = parsers.TextReader(src, **kwds)
   1892         self.unnamed_cols = self._reader.unnamed_cols
   1893 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File tpu_train/metadata/results.csv does not exist: 'tpu_train/metadata/results.csv'

opened by shawngraham 1

xgboost: GPU support
xgboost supports GPUs by setting gpu_hist instead of hist, and the code is prepared for that. Two problems:

Unlike TensorFlow, xgboost does not have a way to automatically determine if a GPU is present.

GPU hist training requires a Pascal minimum; the GPUs in Colaboratory notebooks are K80s (Kepler) which does not qualify.

Will keep at CPU support for now but there has to be a better solution.
opened by minimaxir 1
Fixed some typos and grammar
Changes:

parathesized ==> parenthesized

JSON-seralized ==> JSON-serialized

necessairly ==> necessarily (2)

hyperameter ==> hyperparameter
opened by mikeshatch 0
Ensure col_types works as expected

I started playing with a slightly messy file I had available, and noticed that the col_types argument didn't seem to work to allow me to ignore/override some columns. Found what seems to be the issue and it works for me locally.

Thanks!

opened by drien 0
SyntaxError: invalid decimal literal produced by automl

Wanted to try automl_gs, but I get this error and can't figure out why.

File "C:\Users\XXX\automl_train\model.py", line 3, in from pipeline import * File "C:\Users\XXX\automl_train\pipeline.py", line 29 0_enc = df['0'] ^ SyntaxError: invalid decimal literal

Any ideas about that?

opened by Haifischfutter 0

Google Colab - automl_train/metadata/results.csv does not exist

Input:

from automl_gs import automl_grid_search

automl_grid_search("data.csv", "diagnosis")

Output:

Solving a binary_classification problem, maximizing accuracy using tensorflow.

Modeling with field specifications:
id: ignore
radius_mean: numeric
texture_mean: numeric
perimeter_mean: numeric
area_mean: numeric
smoothness_mean: numeric
compactness_mean: numeric
concavity_mean: numeric
concave points_mean: numeric
symmetry_mean: numeric
fractal_dimension_mean: numeric
radius_se: numeric
texture_se: numeric
perimeter_se: numeric
area_se: numeric
smoothness_se: numeric
compactness_se: numeric
concavity_se: numeric
concave points_se: numeric
symmetry_se: numeric
fractal_dimension_se: numeric
radius_worst: numeric
texture_worst: numeric
perimeter_worst: numeric
area_worst: numeric
smoothness_worst: numeric
compactness_worst: numeric
concavity_worst: numeric
concave points_worst: numeric
symmetry_worst: numeric
fractal_dimension_worst: numeric
Unnamed: 32: numeric
0%
0/100 [00:04<?, ?trial/s]
0%
0/20 [00:00<?, ?epoch/s]
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-37-308e97508c91> in <module>()
      1 from automl_gs import automl_grid_search
      2 
----> 3 automl_grid_search("data.csv", "diagnosis")

5 frames
/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1889         kwds["usecols"] = self.usecols
   1890 
-> 1891         self._reader = parsers.TextReader(src, **kwds)
   1892         self.unnamed_cols = self._reader.unnamed_cols
   1893 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File automl_train/metadata/results.csv does not exist: 'automl_train/metadata/results.csv'

opened by batmanscode 3

Fix duplicates kwarg in pd.cut

From pandas version >= 0.23 you have to specify 'duplicates' kwarg to drop duplicates in 'bins' array. Otherwise, it will throw an error.

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html

opened by kitkatk 0
Using automl-gs for bin packing

Is there any way to use a tool like automl-gs for bin-packing problems? I've seen a way to model it as linear optimization where you do a cross join of all objects and all potential bins, set a constraint of each object being selected once, and then optimizing the bins however desired. Cross joins can end up being needlessly heavy though, so I'm wondering if there is a way to model that sort of problem such that you could use pre-optimized and continually developed tools like this one.

opened by jdotjdot 0
AttributeError: 'float' object has no attribute 'lower'

While trying out automl_gs with Jupiter, I got file not found error:

FileNotFoundError: [Errno 2] File b'automl_train/metadata/results.csv' does not exist: b'automl_train/metadata/results.csv'

Trying do it with terminal before the file missing error it returns:

AttributeError: 'float' object has no attribute 'lower'

Searching in StackOverflow, I found that the problem is how pandas converts inputs to python datatypes.

https://stackoverflow.com/questions/34724246/attributeerror-float-object-has-no-attribute-lower/34724771

Is it possible to prevent this behaviour using automl_gs ?

opened by AlexandraMakarova 0

Releases(v0.2.1)

v0.2.1(Apr 5, 2019)

Resolves Windows support (hopefully) and YAML warnings.

Thanks to all the PRs from the contributors!
Source code(tar.gz)
Source code(zip)
automl_gs-0.2.1.tar.gz(26.59 KB)
v0.2(Mar 26, 2019)

Source code(tar.gz)
Source code(zip)
automl_gs-0.2.tar.gz(26.26 KB)

Owner

Max Woolf

Data Scientist @buzzfeed. Plotter of pretty charts.

GitHub

We have a dataset of user performances. The project is to develop a machine learning model that will predict the salaries of baseball players.

Salary-Prediction-with-Machine-Learning 1. Business Problem Can a machine learning project be implemented to estimate the salaries of baseball players

9 Oct 14, 2022

A model to predict steering torque fully end-to-end

torque_model The torque model is a spiritual successor to op-smart-torque, which was a project to train a neural network to control a car's steering f

4 Jun 3, 2022

This repository contains the code to predict house price using Linear Regression Method

House-Price-Prediction-Using-Linear-Regression The dataset I used for this personal project is from Kaggle uploaded by aariyan panchal. Link of Datase

0 Jan 28, 2022

Houseprices - Predict sales prices and practice feature engineering, RFs, and gradient boosting

House Prices - Advanced Regression Techniques Predicting House Prices with Machine Learning This project is build to enhance my knowledge about machin

1 Jan 1, 2022

Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

2 Jan 18, 2022

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

31 Dec 16, 2022

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

6 Dec 4, 2022

nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

241 Dec 26, 2022

Uses WiFi signals :signal_strength: and machine learning to predict where you are

Uses WiFi signals and machine learning (sklearn's RandomForest) to predict where you are. Even works for small distances like 2-10 meters.

5k Jan 9, 2023

pure-predict: Machine learning prediction in pure Python

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks like scikit-learn and fasttext. It implements the predict methods of these frameworks in pure Python.

84 Dec 29, 2022

Ml based project which uses regression technique to predict the price.

Price-Predictor Ml based project which uses regression technique to predict the price. I have used various regression models and finds the model with

1 Jul 9, 2022

Predict profitability of trades based on indicator buy / sell signals

Predict profitability of trades based on indicator buy / sell signals Trade profitability analysis for trades based on various indicators signals: MAC

1 Dec 15, 2021

Kaggle Competition using 15 numerical predictors to predict a continuous outcome.

Kaggle-Comp.-Data-Mining Kaggle Competition using 15 numerical predictors to predict a continuous outcome as part of a final project for a stats data

1 Dec 28, 2021

Avocado hass time series vs predict price

AVOCADO HASS TIME SERIES VÀ PREDICT PRICE Trước khi vào Heroku muốn giao diện đẹp mọi người chuyển giúp mình theo hình bên dưới https://avocado-hass.h

3 Dec 18, 2021

Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan

Solar-radiation-ISB-MLOps - Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan.

1 Dec 31, 2021

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

164 Jan 4, 2023

Traingenerator 🧙 A web app to generate template code for machine learning ✨

Traingenerator ?? A web app to generate template code for machine learning ✨ ?? Traingenerator is now live! ??

1.2k Jan 7, 2023

machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made this project as a requirement for an internship at Indian Servers. We are now making it open to contribution.

73 Dec 1, 2022

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Model Search Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers sp

1 Dec 13, 2021

Provide an input CSV and a target field to predict, generate a model + code to run it.

Related tags

Overview

automl-gs

Framework Support

How to Use

CLI Arguments/Function Parameters

Examples

How automl-gs Works

Helpful Notes

Known Issues

Future Work

Top Priority

Elsework

Maintainer/Creator

License

Comments

Releases(v0.2.1)

v0.2.1(Apr 5, 2019)

v0.2(Mar 26, 2019)

Owner

Max Woolf

We have a dataset of user performances. The project is to develop a machine learning model that will predict the salaries of baseball players.

A model to predict steering torque fully end-to-end

This repository contains the code to predict house price using Linear Regression Method

Houseprices - Predict sales prices and practice feature engineering, RFs, and gradient boosting

Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

Uses WiFi signals :signal_strength: and machine learning to predict where you are

pure-predict: Machine learning prediction in pure Python

Ml based project which uses regression technique to predict the price.

Predict profitability of trades based on indicator buy / sell signals

Kaggle Competition using 15 numerical predictors to predict a continuous outcome.

Avocado hass time series vs predict price

Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Traingenerator 🧙 A web app to generate template code for machine learning ✨

machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.