pure-predict: Machine learning prediction in pure Python
pure-predict
speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks like scikit-learn and fasttext. It implements the predict methods of these frameworks in pure Python.
Primary Use Cases
The primary use case for pure-predict
is the following scenario:
- A model is trained in an environment without strong container footprint constraints. Perhaps a long running "offline" job on one or many machines where installing a number of python packages from PyPI is not at all problematic.
- At prediction time the model needs to be served behind an API. Typical access patterns are to request a prediction for one "record" (one "row" in a
numpy
array or one string of text to classify) per request or a mini-batch of records per request. - Preferred infrastructure for the prediction service is either serverless (AWS Lambda) or a container service where the memory footprint of the container is constrained.
- The fitted model object's artifacts needed for prediction (coefficients, weights, vocabulary, decision tree artifacts, etc.) are relatively small (10s to 100s of MBs).
In this scenario, a container service with a large dependency footprint can be overkill for a microservice, particularly if the access patterns favor the pricing model of a serverless application. Additionally, for smaller models and single record predictions per request, the numpy
and scipy
functionality in the prediction methods of popular machine learning frameworks work against the application in terms of latency, underperforming pure python in some cases.
Check out the blog post for more information on the motivation and use cases of pure-predict
.
Package Details
It is a Python package for machine learning prediction distributed under the Apache 2.0 software license. It contains multiple subpackages which mirror their open source counterpart (scikit-learn
, fasttext
, etc.). Each subpackage has utilities to convert a fitted machine learning model into a custom object containing prediction methods that mirror their native counterparts, but converted to pure python. Additionally, all relevant model artifacts needed for prediction are converted to pure python.
A pure-predict
model object can then be pickled and later unpickled without any 3rd party dependencies other than pure-predict
.
This eliminates the need to have large dependency packages installed in order to make predictions with fitted machine learning models using popular open source packages for training models. These dependencies (numpy
, scipy
, scikit-learn
, fasttext
, etc.) are large in size and not always necessary to make fast and accurate predictions. Additionally, they rely on C extensions that may not be ideal for serverless applications with a python runtime.
Quick Start Example
In a python enviornment with scikit-learn
and its dependencies installed:
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from pure_sklearn.map import convert_estimator
# fit sklearn estimator
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier()
clf.fit(X, y)
# convert to pure python estimator
clf_pure_predict = convert_estimator(clf)
with open("model.pkl", "wb") as f:
pickle.dump(clf_pure_predict, f)
# make prediction with sklearn estimator
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]
In a python enviornment with only pure-predict
installed:
import pickle
# load pickled model
with open("model.pkl", "rb") as f:
clf = pickle.load(f)
# make prediction with pure-predict object
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]
Subpackages
pure_sklearn
Prediction in pure python for a subset of scikit-learn
estimators and transformers.
-
- estimators
-
- linear models - supports the majority of linear models for classification
- trees - decision trees, random forests, gradient boosting and xgboost
- naive bayes - a number of popular naive bayes classifiers
- svm - linear SVC
-
- transformers
-
- preprocessing - normalization and onehot/ordinal encoders
- impute - simple imputation
- feature extraction - text (tfidf, count vectorizer, hashing vectorizer) and dictionary vectorization
- pipeline - pipelines and feature unions
Sparse data - supports a custom pure python sparse data object - sparse data is handled as would be expected by the relevent transformers and estimators
pure_fasttext
Prediction in pure python for fasttext
.
- supervised - predicts labels for supervised models; no support for quantized models (blocked by this issue)
- unsupervised - lookup of word or sentence embeddings given input text
Installation
Dependencies
pure-predict
requires:
- Python (>= 3.6)
Dependency Notes
pure_sklearn
has been tested withscikit-learn
versions >= 0.20 -- certain functionality may work with lower versions but are not guaranteed. Some functionality is explicitly not supported for certainscikit-learn
versions and exceptions will be raised as appropriate.xgboost
requires version >= 0.82 for support withpure_sklearn
.pure-predict
is not supported with Python 2.fasttext
versions <= 0.9.1 have been tested.
User Installation
The easiest way to install pure-predict
is with pip
:
pip install --upgrade pure-predict
You can also download the source code:
git clone https://github.com/Ibotta/pure-predict.git
Testing
With pytest
installed, you can run tests locally:
pytest pure-predict
Examples
The package contains examples on how to use pure-predict
in practice.
Calls for Contributors
Contributing to pure-predict
is welcomed by any contributors. Specific calls for contribution are as follows:
- Examples, tests and documentation -- particularly more detailed examples with performance testing of various estimators under various constraints.
- Adding more
pure_sklearn
estimators. Thescikit-learn
package is extensive and only partially covered bypure_sklearn
. Regression tasks in particular missing frompure_sklearn
. Clustering, dimensionality reduction, nearest neighbors, feature selection, non-linear SVM, and more are also omitted and would be good candidates for extendingpure_sklearn
. - General efficiency. There is likely low hanging fruit for improving the efficiency of the
numpy
andscipy
functionality that has been ported topure-predict
. - Threading could be considered to improve performance -- particularly for making predictions with multiple records.
- A public AWS lambda layer containing
pure-predict
.
Background
The project was started at Ibotta Inc. on the machine learning team and open sourced in 2020. It is currently maintained by the machine learning team at Ibotta.
Acknowledgements
Thanks to David Mitchell and Andrew Tilley for internal review before open source. Thanks to James Foley for logo artwork.