pure-predict: Machine learning prediction in pure Python

Overview
pure-predict

pure-predict: Machine learning prediction in pure Python

License Build Status PyPI Package Downloads Python Versions

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks like scikit-learn and fasttext. It implements the predict methods of these frameworks in pure Python.

Primary Use Cases

The primary use case for pure-predict is the following scenario:

  1. A model is trained in an environment without strong container footprint constraints. Perhaps a long running "offline" job on one or many machines where installing a number of python packages from PyPI is not at all problematic.
  2. At prediction time the model needs to be served behind an API. Typical access patterns are to request a prediction for one "record" (one "row" in a numpy array or one string of text to classify) per request or a mini-batch of records per request.
  3. Preferred infrastructure for the prediction service is either serverless (AWS Lambda) or a container service where the memory footprint of the container is constrained.
  4. The fitted model object's artifacts needed for prediction (coefficients, weights, vocabulary, decision tree artifacts, etc.) are relatively small (10s to 100s of MBs).
diagram

In this scenario, a container service with a large dependency footprint can be overkill for a microservice, particularly if the access patterns favor the pricing model of a serverless application. Additionally, for smaller models and single record predictions per request, the numpy and scipy functionality in the prediction methods of popular machine learning frameworks work against the application in terms of latency, underperforming pure python in some cases.

Check out the blog post for more information on the motivation and use cases of pure-predict.

Package Details

It is a Python package for machine learning prediction distributed under the Apache 2.0 software license. It contains multiple subpackages which mirror their open source counterpart (scikit-learn, fasttext, etc.). Each subpackage has utilities to convert a fitted machine learning model into a custom object containing prediction methods that mirror their native counterparts, but converted to pure python. Additionally, all relevant model artifacts needed for prediction are converted to pure python.

A pure-predict model object can then be pickled and later unpickled without any 3rd party dependencies other than pure-predict.

This eliminates the need to have large dependency packages installed in order to make predictions with fitted machine learning models using popular open source packages for training models. These dependencies (numpy, scipy, scikit-learn, fasttext, etc.) are large in size and not always necessary to make fast and accurate predictions. Additionally, they rely on C extensions that may not be ideal for serverless applications with a python runtime.

Quick Start Example

In a python enviornment with scikit-learn and its dependencies installed:

import pickle

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from pure_sklearn.map import convert_estimator

# fit sklearn estimator
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier()
clf.fit(X, y)

# convert to pure python estimator
clf_pure_predict = convert_estimator(clf)
with open("model.pkl", "wb") as f:
    pickle.dump(clf_pure_predict, f)

# make prediction with sklearn estimator
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]

In a python enviornment with only pure-predict installed:

import pickle

# load pickled model
with open("model.pkl", "rb") as f:
    clf = pickle.load(f)

# make prediction with pure-predict object
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]

Subpackages

pure_sklearn

Prediction in pure python for a subset of scikit-learn estimators and transformers.

  • estimators
    • linear models - supports the majority of linear models for classification
    • trees - decision trees, random forests, gradient boosting and xgboost
    • naive bayes - a number of popular naive bayes classifiers
    • svm - linear SVC
  • transformers
    • preprocessing - normalization and onehot/ordinal encoders
    • impute - simple imputation
    • feature extraction - text (tfidf, count vectorizer, hashing vectorizer) and dictionary vectorization
    • pipeline - pipelines and feature unions

Sparse data - supports a custom pure python sparse data object - sparse data is handled as would be expected by the relevent transformers and estimators

pure_fasttext

Prediction in pure python for fasttext.

  • supervised - predicts labels for supervised models; no support for quantized models (blocked by this issue)
  • unsupervised - lookup of word or sentence embeddings given input text

Installation

Dependencies

pure-predict requires:

Dependency Notes

  • pure_sklearn has been tested with scikit-learn versions >= 0.20 -- certain functionality may work with lower versions but are not guaranteed. Some functionality is explicitly not supported for certain scikit-learn versions and exceptions will be raised as appropriate.
  • xgboost requires version >= 0.82 for support with pure_sklearn.
  • pure-predict is not supported with Python 2.
  • fasttext versions <= 0.9.1 have been tested.

User Installation

The easiest way to install pure-predict is with pip:

pip install --upgrade pure-predict

You can also download the source code:

git clone https://github.com/Ibotta/pure-predict.git

Testing

With pytest installed, you can run tests locally:

pytest pure-predict

Examples

The package contains examples on how to use pure-predict in practice.

Calls for Contributors

Contributing to pure-predict is welcomed by any contributors. Specific calls for contribution are as follows:

  1. Examples, tests and documentation -- particularly more detailed examples with performance testing of various estimators under various constraints.
  2. Adding more pure_sklearn estimators. The scikit-learn package is extensive and only partially covered by pure_sklearn. Regression tasks in particular missing from pure_sklearn. Clustering, dimensionality reduction, nearest neighbors, feature selection, non-linear SVM, and more are also omitted and would be good candidates for extending pure_sklearn.
  3. General efficiency. There is likely low hanging fruit for improving the efficiency of the numpy and scipy functionality that has been ported to pure-predict.
  4. Threading could be considered to improve performance -- particularly for making predictions with multiple records.
  5. A public AWS lambda layer containing pure-predict.

Background

The project was started at Ibotta Inc. on the machine learning team and open sourced in 2020. It is currently maintained by the machine learning team at Ibotta.

Acknowledgements

Thanks to David Mitchell and Andrew Tilley for internal review before open source. Thanks to James Foley for logo artwork.

IbottaML
Comments
  • error when using using convert_estimator with MultiOutputRegressor

    error when using using convert_estimator with MultiOutputRegressor

    Describe the bug ValueError: Cannot find 'pure_sklearn' counterpart for MultiOutputRegressor

    To Reproduce Convert a pickled MultiOutputRegressor model to pure_sklearn using: model = pickle.load(filename, 'rb') model = convert_estimator(model)

    Expected behavior Expected conversion to pure_sklearn version

    Additional context Add any other context about the problem here.

    bug 
    opened by nrchade 1
  • Cannot find 'pure_sklearn' counterpart for ColumnTransformer

    Cannot find 'pure_sklearn' counterpart for ColumnTransformer

    convert_estimator giving value error if I use for sklearn pipeline
    pipe = convert_estimator(pipe)
    i used above code for pipeline and I got below error image

    bug 
    opened by chethan-avyay 1
  • [Question] Easy differentiation between RndmForest Scikit-learn and pure-predict upon loading

    [Question] Easy differentiation between RndmForest Scikit-learn and pure-predict upon loading

    Hi,

    great package! I really like the simplicity and the speed improvement.

    I have users that would like to be able to use both "versions" of the classifier and I would like to find an easy way to differentiate between the two directly after loading them.

    so far I came up with based on it's object description:

    
        with open(path_to_sav,"rb") as f:
            clf= pickle.load(f)
        if 'pure' in str(clf):
           print(clf)
    
    Output:
    <pure_sklearn.ensemble._forest.RandomForestClassifierPure object at 0x00000201490EEF98>
    

    Is there a better way to do this?

    enhancement 
    opened by JensBlack 1
  • [Question] Cythonize the package for faster speed?

    [Question] Cythonize the package for faster speed?

    This is already quite a fast package. Can we crazier and move all the pure python code to Cython code? I am new to Cython and I am not sure if that would increase the speed. What do you think?

    opened by greenspray9 1
  • Question about pure predict

    Question about pure predict

    hi, i have a question about pure predict. I am still new to the field and am looking for a way to speed up the prediction of my program (python 3.7 under Windwos). I saved the model in an hdf5 file. My laptop, on which the program runs, has only 8 ram RAM

    Call of the predict function ippreds = model.predict (np.expand_dims (ipframe, axis = 0)) [0]

    for me the prediction takes an average of 13-15 milliseconds

    [INFO] Start Predict..Sun Apr 5 18:15:12 2020 [INFO] Finish Predict..Sun Apr 5 18:15:12 2020 Time Total: 0:00:00.155961

    [INFO] Start Predict..Sun Apr 5 18:15:12 2020 [INFO] Finish Predict..Sun Apr 5 18:15:12 2020 Time Total: 0:00:00.152110

    [INFO] Start Predict..Sun Apr 5 18:15:12 2020 [INFO] Finish Predict..Sun Apr 5 18:15:13 2020 Time Total: 0:00:00.148105

    could I improve the result with Pure Predict?

    Many thanks in advance Anja

    opened by staebchen0 1
  • Update min numpy version

    Update min numpy version

    Update min numpy version

    Description

    Updates the minimum numpy version for tests

    Motivation and Context

    This is breaking xgboost tests due to failed pandas install

    How Has This Been Tested?

    Travis build passed

    Types of changes

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Checklist:

    • [x] My code follows the code style of this project.
    • [ ] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have added reviewers to the PR.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    opened by denver1117 0
  • Apply formatting linter

    Apply formatting linter

    Apply linter

    Applies black linter

    Description

    Applies one-time in-place file linting

    Motivation and Context

    Code style conformation

    Types of changes

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Checklist:

    • [x] My code follows the code style of this project.
    • [ ] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have added reviewers to the PR.
    • [ ] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    opened by denver1117 0
  • Add downloads status badge

    Add downloads status badge

    Add downloads status badge

    Description

    Add new status badge for total lifetime downloads

    Motivation and Context

    This is useful to track and display

    How Has This Been Tested?

    Looks good in README

    Types of changes

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [x] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [ ] I have added reviewers to the PR.
    • [ ] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    opened by denver1117 0
  • Drop py35

    Drop py35

    Drop py35 Support

    Description

    Drops support for python 3.5

    Motivation and Context

    Getting Travis errors for package compatibility with python 3.5

    How Has This Been Tested?

    Types of changes

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [x] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [ ] I have added reviewers to the PR.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    opened by denver1117 0
  • Pin fasttext

    Pin fasttext

    Pin fasttext

    Description

    This PR pins fasttext at or below 0.9.1. Newer versions are failing tests.

    Motivation and Context

    Keeps builds passing until the new version compatibility can be addressed.

    Types of changes

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [x] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [ ] I have added reviewers to the PR.
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    opened by denver1117 0
  • Add diagram image and blog link to README

    Add diagram image and blog link to README

    Add diagram image and blog link to README

    Description

    Adds the diagram image from the blog post and a link to the blog post to the README

    Motivation and Context

    More information and context in the README

    Types of changes

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [x] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [ ] I have added reviewers to the PR.
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    opened by denver1117 0
  • Error when predict with converted model built with CountVectorizer(binary=True)

    Error when predict with converted model built with CountVectorizer(binary=True)

    Describe the bug An error is raised when making an inference with a converted sklearn model built with CountVectorizer(binary=True). It's ok if binary=False

    To Reproduce

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.linear_model import LogisticRegression
    from sklearn.pipeline import Pipeline
    from pure_sklearn.map import convert_estimator
    
    vectorizer = CountVectorizer(binary=True)
    model = LogisticRegression(random_state=0)
    pipeline = Pipeline([
        ('vect', vectorizer),
        ('clf', model)
    ])
    
    X_train = ['one text', 'two text', 'three text']
    y_train = ['1', '2', '3']
    pipeline.fit(X_train, y_train)
    converted = convert_estimator(pipeline)
    converted.predict(['four'])
    

    It's ok if a vectorizer is created with binary=False.

    Expected behavior There shouldn't be any errors.

    Additional context Add any other context about the problem here.

    bug 
    opened by phongvis 0
  • Allow sparse input for naive bayes classifier

    Allow sparse input for naive bayes classifier

    I tried converting pipeline to pure_sklearn. The pipeline consist of TfidfVectorizer and MultinomialNB. The output of TfIdfVectorizer is sparse array as input to MultinomialNB. However, the naive bayes predict method does not support sparse array as input (X), as defined in the code below and thus throws error.

    https://github.com/Ibotta/pure-predict/blob/c3431b79af4df9794c9f99246fa359a6c72a10ee/pure_sklearn/naive_bayes.py#L25

    Possible solution I'm not sure why the code above is necessary to reject sparse input. However I tried changing to allow sparse and tested it. I don't encounter any issue as the estimator works as expected.

    X = check_array(X, handle_sparse="allow")

    Is this the right way?

    I've created a test method under test_pipeline to test this scenario. I can submit a PR if you want to review.

    My dev environment: Package Version


    fasttext 0.9.2 numpy 1.21.4 pandas 1.3.4 pure-predict 0.0.4 pytest 6.2.5 scikit-learn 1.0.1 scipy 1.7.2

    enhancement 
    opened by cyan198 0
  • Sklearn object having a preprocessor function cannot be converted

    Sklearn object having a preprocessor function cannot be converted

    pure_sklearn.map.convert_estimator function crashes during a conversion of a sklearn unit having a functional variable (a CountVectorizer with a preprocessor parameter in my case).

    ValueError: Object contains invalid type: <class 'function'>

    To Reproduce

    from pure_sklearn.map import convert_estimator
    from sklearn.feature_extraction.text import CountVectorizer
    
    vectorizer = CountVectorizer(preprocessor=lambda x:x)
    vectorizer.fit(["aaaa", "bbbb", "cccc"])
    
    convert_estimator(vectorizer)
    

    Additional context The bug can be fixed by adding built-in types.FunctionType into TYPES tuple of pure_sklearn.utils.py module. So I suggest to replace:

    TYPES = (int, float, str, bool, type)
    

    with:

    from types import FunctionType
    
    TYPES = (int, float, str, bool, type, FunctionType)
    
    bug 
    opened by ogunoz 1
Owner
Ibotta
Ibotta
This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

Hazim Arafa 6 Dec 4, 2022
Uses WiFi signals :signal_strength: and machine learning to predict where you are

Uses WiFi signals and machine learning (sklearn's RandomForest) to predict where you are. Even works for small distances like 2-10 meters.

Pascal van Kooten 5k Jan 9, 2023
We have a dataset of user performances. The project is to develop a machine learning model that will predict the salaries of baseball players.

Salary-Prediction-with-Machine-Learning 1. Business Problem Can a machine learning project be implemented to estimate the salaries of baseball players

Ayşe Nur Türkaslan 9 Oct 14, 2022
Uber Open Source 1.6k Dec 31, 2022
Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python

Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python Overview Bank Jago has attracted investors' attention since the end

Najibulloh Asror 3 Feb 10, 2022
Predico Disease Prediction system based on symptoms provided by patient- using Python-Django & Machine Learning

Predico Disease Prediction system based on symptoms provided by patient- using Python-Django & Machine Learning

Felix Daudi 1 Jan 6, 2022
customer churn prediction prevention in telecom industry using machine learning and survival analysis

Telco Customer Churn Prediction - Plotly Dash Application Description This dash application allows you to predict telco customer churn using machine l

Benaissa Mohamed Fayçal 3 Nov 20, 2021
Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Amplo 10 May 15, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

Stox 31 Dec 16, 2022
nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Microsoft 241 Dec 26, 2022
This repository contains the code to predict house price using Linear Regression Method

House-Price-Prediction-Using-Linear-Regression The dataset I used for this personal project is from Kaggle uploaded by aariyan panchal. Link of Datase

null 0 Jan 28, 2022
Ml based project which uses regression technique to predict the price.

Price-Predictor Ml based project which uses regression technique to predict the price. I have used various regression models and finds the model with

Garvit Verma 1 Jul 9, 2022
Predict profitability of trades based on indicator buy / sell signals

Predict profitability of trades based on indicator buy / sell signals Trade profitability analysis for trades based on various indicators signals: MAC

Tomasz Porzycki 1 Dec 15, 2021
Kaggle Competition using 15 numerical predictors to predict a continuous outcome.

Kaggle-Comp.-Data-Mining Kaggle Competition using 15 numerical predictors to predict a continuous outcome as part of a final project for a stats data

moisey alaev 1 Dec 28, 2021
Avocado hass time series vs predict price

AVOCADO HASS TIME SERIES VÀ PREDICT PRICE Trước khi vào Heroku muốn giao diện đẹp mọi người chuyển giúp mình theo hình bên dưới https://avocado-hass.h

hieulmsc 3 Dec 18, 2021