An AutoML survey focusing on practical systems.

AutoGOAL

Last update: Aug 14, 2022

Related tags

Overview

AutoML Survey

An (in-progress) AutoML survey focusing on practical systems.

This project is a community effort in constructing and maintaining an up-to-date beginner-friendly introduction to AutoML, focusing on practical systems. AutoML is a big field, and continues to grow daily. Hence, we cannot hope to provide a comprehensive description of every interesting idea or approach available. Thus, we decided to focus on practical AutoML systems, and spread outwards from there into the methodologies and theoretical concepts that power these systems. Our intuition is that, even though there are a lot of interesting ideas still in research stage, the most mature and battle-tested concepts are those that have been succesfully applied to construct practical AutoML systems.

To this end, we are building a database of qualitative criteria for all AutoML systems we've heard of. We define an AutoML system as a software project that can be used by non-experts in machine learning to build effective ML pipelines on at least some common domains and tasks. It doesn't matter if its open-source and/or commercial, a library or an application with a GUI, or a cloud service. What matters is that it is intended to be used in practice, as opposed to, say, a reference implementation of a novel AutoML strategy in a Jupyter Notebook.

Features of an AutoML system

For each of them we are creating a system card that describes, in our opinion, the most relevant features of the system, both from the scientific and the engineering points of view. To describe an AutoML system, we use a YAML-based definition. Most of the features are self-explanatory.

💡 Check data/systems/_template.yml for a starting template.

Basic information

Characteristics about the basic information of the system as a software product.

name (str): Name of the system.
description (str): A short (2-4 sentences) description of the sytem.
website (str): The URL of the main website or documentation.
open_source (bool): Whether the system is open-source.
institutions (list[str]): List of businesses or academic institutions that directly support the development of the system, and/or hold intellectual property over it.
repository (str): If it's open-source, link of a public source code repository, otherwise null.
license (str): If it's open-source, a license key, otherwise null.
references (list[str]): List of links to relevant papers, preferably DOIs or other universal handlers, but can also be links to arxiv.org or other repositories sorted by most relevant papers, not date.

User interfaces

Characteristics describing how the users interact with the system.

cli (bool): Whether the system has a command line interface
gui (bool): Whether the system has a graphic user interface
http (bool): Whether the system can used from an HTTP RESTful API
library (bool): Whether the system can be linked as a code library
programming_languages (list[str]): List of programming languages in which the system can be used, i.e., it is either natively coded in that language or there are maintained bindings (as opposed to using language X's standard way to call code from language Y).

Domains

Characteristics describing the domains in which the system can be applied, which roughly correspond to the types of input data that the system can handle.

domains (list[str]): Domains in which the system can be deployed. Valid values are:
- images
- nlp
- tabular
- time_series
multi_domain (bool): Whether the system supports multiple domains for a single workflow, e.g., by allowing multiple inputs of different types simultaneously

Techniques

Characteristics describing the actual models and techniques used in the system, and the underlying ML libraries where those techniques are implemented.

techniques (list[str]): List of high-level techniques that are available in the systems, broadly classified according to model families. Valid values are:
- linear_models
- trees
- bayesian
- kernel_machines
- graphical_models
- mlp
- cnn
- rnn
- pretrained
- ensembles
- ad_hoc ( 📝 indicates non-ML algorithms, e.g., tokenizers)
distillation (bool): Whether the system supports model distillation
ml_libraries (list[str]): List of ML libraries that support the system, i.e., where the techniques are actually implemented, if any. Valid values are lists of strings. Some examples are:
- scikit-learn
- keras
- pytorch
- nltk
- spacy
- transformers

Tasks

Characteristics describing the types of tasks, or problems, in which the system can be applied, which roughly correspond to the types of outputs supported.

tasks (list[str]): List of high-level tasks the system can perform automatically. Valid values are:
- classification
- structured_prediction
- structured_generation
- unstructured_generation
- regression
- clustering
- imputation
- segmentation
- feature_preprocessing
- feature_selection
- data_augmentation
- dimensionality_reduction
- data_preprocessing ( 📝 domain-agonostic data preprocessing such as normalization and scaling)
- domain_preprocessing ( 📝 refers to domain-specific preprocessing, e.g., stemming)
multi_task: Whether the system supports multiple tasks in a single workflow, e.g., by allowing multiple output heads from the same neural network

Search strategies

Characteristics describing the optimizaction/search strategies used for model search and/or hyperparameter tunning.

search_strategies (list[str]): List of high-level search strategies that are available in the system. Valid values are:
- random
- evolutionary
- gradient_descent
- hill_climbing
- bayesian
- grid
- hyperband
- reinforcement_learning
- constructive
- monte_carlo
meta_learning (list[str]): If the system includes meta-learning, list of broadly classified techniques used. Valid values are:
- portfolio
- warm_start

Search space

Characteristics describing the search space, the types of hyperparameters that can be optimized, and the types of ML pipelines that can be represented in this space.

search_space: High-level characteristics of the hyperparameter search space.
- hierarchical (bool): If there are hyperparameters that only make sense conditioned to others.
- probabilistic (bool): If the hyperparameter space has an associated probabilistic model.
- differentiable (bool): If the hyperameter space can be used for gradient descent.
- automatic (bool): If the global structure of the hyperparameter space is inferred automatically from, e.g., type annotations or model's documentation, as opposed to explicitely defined by the developers or the user.
- hyperparameters (list[str]): Types of hyperparameters that can be optimized. Valid values are:
  - continuous
  - discrete
  - categorical
  - conditional
- pipelines: Types of pipelines that can be discovered by the AutoML process. Each of the following keys is boolean.
  - single (bool): A single estimator (or model in general)
  - fixed (bool): A fixed pipeline with several, but predefined, steps
  - linear (bool): A variable-length pipeline where each step feeds on the immediately previous output
  - graph (bool): An arbitrarily graph-shaped pipeline where each step can feed on any of the previous outputs
- robust (bool): Whether the seach space contains potentially invalid pipelines that are only discovered when evaluated, e.g., allowing a dense-only estimator to precede a sparse transformer.

Software architecture

Other characteristics describing general features of the system as a software product.

extensible (bool): Whether the system is designed to be extensible, in the sense that a user can add a single new type of model, or search algorithm, etc., in an easy manner, not needing to modify any part of the system/
accessible (bool): Whether the models obtained from the AutoML process can be freely inspected by the user up to the level of individual parameters (e.g., neural network weights).
portable (bool): Whether the models obtained can be exported out of the AutoML system, either on a standard format, or, at least, in a format native of the underlying ML library,such that they can be deployed on another platform without depending on the AutoML system itself.
computational_resources: Computational resources that, if available, can be leveraged by the system.
- gpu (bool): Whether the system supports GPUs.
- tpu (bool): Whether the system supports TPUs.
- cluster (bool): Whether the system supports cluster-based parallelism.

How to contribute

If you are an author or a user of any practical AutoML system that roughly fits the previous criteria, we would love to have your contributions. You can add new systems, add information for existing ones, or fix anything that is incorrect.

To do this, either create a new or modify an existing file in data/systems. Once done, you can run make check to ensure that the modifications are valid with respect to the schema defined in scripts/models.py. If you need to add new fields, or new values to any of the enumerations defined, feel free to modify the corresponding schema as well (and modify both data/systems/_template.yml and this README).

Once validated, you can open a pull request.

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

A warping based image translation model focusing on upper body synthesis.

Pose2Img Upper body image synthesis from skeleton(Keypoints). Sub module in the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis

15 Nov 10, 2022

Simulation of self-focusing of laser beams in condensed media

What is it? Program for scientific research, which allows to simulate the phenomenon of self-focusing of different laser beams (including Gaussian, ri

13 Dec 24, 2022

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

H2O H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Fl

6.1k Jan 5, 2023

Model search is a framework that implements AutoML algorithms for model architecture search at scale

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers speed up their exploration process for finding the right model architecture for their classification problems (i.e., DNNs with different types of layers).

3.2k Dec 31, 2022

Clairvoyance: a Unified, End-to-End AutoML Pipeline for Medical Time Series

Clairvoyance: A Pipeline Toolkit for Medical Time Series Authors: van der Schaar Lab This repository contains implementations of Clairvoyance: A Pipel

$van_der_Schaar \LAB$ 89 Dec 7, 2022

An AutoML Library made with Optuna and PyTorch Lightning

An AutoML Library made with Optuna and PyTorch Lightning Installation Recommended pip install -U gradsflow From source pip install git+https://github.

294 Dec 17, 2022

Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

216 Dec 23, 2022

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Model Search Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers sp

1 Dec 13, 2021

Neural networks applied in recognizing guitar chords using python, AutoML.NET with C# and .NET Core

Chord Recognition Demo application The demo application is written in C# with .NETCore. As of July 9, 2020, the only version available is for windows

24 Oct 22, 2022

AutoDeeplab / auto-deeplab / AutoML for semantic segmentation, implemented in Pytorch

AutoML for Image Semantic Segmentation Currently this repo contains the only working open-source implementation of Auto-Deeplab which, by the way out-

299 Dec 17, 2022

MMRazor: a model compression toolkit for model slimming and AutoML

Documentation: https://mmrazor.readthedocs.io/ English | 简体中文 Introduction MMRazor is a model compression toolkit for model slimming and AutoML, which

899 Jan 2, 2023

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset This repository provides a unified online platform, LoLi-P

457 Jan 3, 2023

Repository for the COLING 2020 paper "Explainable Automated Fact-Checking: A Survey."

Explainable Fact Checking: A Survey This repository and the accompanying webpage contain resources for the paper "Explainable Fact Checking: A Survey"

42 Nov 17, 2022

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

132 Nov 25, 2022

Deep Learning for 3D Point Clouds: A Survey (IEEE TPAMI, 2020)

🔥Deep Learning for 3D Point Clouds (IEEE TPAMI, 2020)

1.4k Jan 8, 2023

This is the accompanying toolbox for the paper "A Survey on GANs for Anomaly Detection"

Anomaly detection using GANs.

77 Nov 30, 2022

Results of Robot Framework 5.0 survey

Robot Framework 5.0 survey results We had a survey asking what features Robot Framework community members would like to see in the forthcoming Robot F

2 Oct 16, 2021

Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

20 Dec 14, 2022

Mining the Stack Overflow Developer Survey

Mining the Stack Overflow Developer Survey A prototype data mining application to compare the accuracy of decision tree and random forest regression m

1 Nov 16, 2021

An AutoML survey focusing on practical systems.

Related tags

Overview

AutoML Survey

Features of an AutoML system

Basic information

User interfaces

Domains

Techniques

Tasks

Search strategies

Search space

Software architecture

How to contribute

License

You might also like...

A warping based image translation model focusing on upper body synthesis.

Simulation of self-focusing of laser beams in condensed media

Model search is a framework that implements AutoML algorithms for model architecture search at scale

Clairvoyance: a Unified, End-to-End AutoML Pipeline for Medical Time Series

An AutoML Library made with Optuna and PyTorch Lightning

Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Neural networks applied in recognizing guitar chords using python, AutoML.NET with C# and .NET Core

AutoDeeplab / auto-deeplab / AutoML for semantic segmentation, implemented in Pytorch

MMRazor: a model compression toolkit for model slimming and AutoML

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset

Repository for the COLING 2020 paper "Explainable Automated Fact-Checking: A Survey."

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

Deep Learning for 3D Point Clouds: A Survey (IEEE TPAMI, 2020)

This is the accompanying toolbox for the paper "A Survey on GANs for Anomaly Detection"

Results of Robot Framework 5.0 survey

Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Mining the Stack Overflow Developer Survey

Owner

AutoGOAL

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Examples and code for the Practical Machine Learning workshop series

This is a public repo where code samples are stored for the book Practical MLOps.

A collection of neat and practical data science and machine learning projects

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

SSD: Single Shot MultiBox Detector pytorch implementation focusing on simplicity

A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.

A Kernel fuzzer focusing on race bugs

eyes is a Public Opinion Mining System focusing on taiwanese forums such as PTT, Dcard.