naked
naked
is a Python tool which allows you to strip a model and only keep what matters for making predictions. The result is a pure Python function with no third-party dependencies that you can simply copy/paste wherever you wish.
This is simpler than deploying an API endpoint or loading a serialized model. The jury is still out on whether this is sane or not. Of course I'm not the first one to have done this, for instance see sklearn-porter.
Installation
pip install git+https://github.com/MaxHalford/naked
Examples
sklearn.linear_model.LinearRegression
First, we fit a model.
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
lin_reg = LinearRegression().fit(X, y)
lin_reg.fit(X, y)
Then, we strip it.
import naked
print(naked.strip(lin_reg))
Which produces the following output.
def linear_regression(x):
coef_ = [1.0000000000000002, 1.9999999999999991]
intercept_ = 3.0000000000000018
return intercept_ + sum(xi * wi for xi, wi in enumerate(coef_))
sklearn.pipeline.Pipeline
import naked
from sklearn import linear_model
from sklearn import feature_extraction
from sklearn import pipeline
from sklearn import preprocessing
model = pipeline.make_pipeline(
feature_extraction.text.TfidfVectorizer(),
preprocessing.Normalizer(),
linear_model.LogisticRegression(solver='liblinear')
)
docs = ['Sad', 'Angry', 'Happy', 'Joyful']
is_positive = [False, False, True, True]
model.fit(docs, is_positive)
print(naked.strip(model))
This produces the following output.
def tfidf_vectorizer(x):
lowercase = True
norm = 'l2'
vocabulary_ = {'sad': 3, 'angry': 0, 'happy': 1, 'joyful': 2}
idf_ = [1.916290731874155, 1.916290731874155, 1.916290731874155, 1.916290731874155]
import re
if lowercase:
x = x.lower()
# Tokenize
x = re.findall(r"(?u)\b\w\w+\b", x)
x = [xi for xi in x if len(xi) > 1]
# Count term frequencies
from collections import Counter
tf = Counter(x)
total = sum(tf.values())
# Compute the TF-IDF of each tokenized term
tfidf = [0] * len(vocabulary_)
for term, freq in tf.items():
try:
index = vocabulary_[term]
except KeyError:
continue
tfidf[index] = freq * idf_[index] / total
# Apply normalization
if norm == 'l2':
norm_val = sum(xi ** 2 for xi in tfidf) ** .5
return [v / norm_val for v in tfidf]
def normalizer(x):
norm = 'l2'
if norm == 'l2':
norm_val = sum(xi ** 2 for xi in x) ** .5
elif norm == 'l1':
norm_val = sum(abs(xi) for xi in x)
elif norm == 'max':
norm_val = max(abs(xi) for xi in x)
return [xi / norm_val for xi in x]
def logistic_regression(x):
coef_ = [[-0.40105811611957726, 0.40105811611957726, 0.40105811611957726, -0.40105811611957726]]
intercept_ = [0.0]
import math
logits = [
b + sum(xi * wi for xi, wi in zip(x, w))
for w, b in zip(coef_, intercept_)
]
# Sigmoid activation for binary classification
if len(logits) == 1:
p_true = 1 / (1 + math.exp(-logits[0]))
return [1 - p_true, p_true]
# Softmax activation for multi-class classification
z_max = max(logits)
exp = [math.exp(z - z_max) for z in logits]
exp_sum = sum(exp)
return [e / exp_sum for e in exp]
def pipeline(x):
x = tfidf_vectorizer(x)
x = normalizer(x)
x = logistic_regression(x)
return x
FAQ
What models are supported?
>>> import naked
>>> print(naked.AVAILABLE)
sklearn
LinearRegression
LogisticRegression
Normalizer
StandardScaler
TfidfVectorizer
Will this work for all library versions?
Not by design. A release of naked
is intended to support a library above a particular version. If we notice that naked
doesn't work for a newer version of a given library, then a new version of naked
should be released to handle said library version. You may refer to the pyproject.toml
file to view library support.
How can I trust this is correct?
This package is really easy to unit test. One simply has to compare the outputs of the model with its "naked" version and check that the outputs are identical. Check out the test_naked.py
file if you're curious.
How should I handle feature names?
Let's take the example of a multi-class logistic regression trained on the wine dataset.
from sklearn import datasets
from sklearn import linear_model
from sklearn import pipeline
from sklearn import preprocessing
dataset = datasets.load_wine()
X = dataset.data
y = dataset.target
model = pipeline.make_pipeline(
preprocessing.StandardScaler(),
linear_model.LogisticRegression()
)
model.fit(X, y)
By default, the strip
function produces a function that takes as input a list of feature values. Instead, let's say we want to evaluate the function on a dictionary of features, thus associating each feature value with a name.
x = dict(zip(dataset.feature_names, X[0]))
print(x)
{'alcohol': 14.23,
'malic_acid': 1.71,
'ash': 2.43,
'alcalinity_of_ash': 15.6,
'magnesium': 127.0,
'total_phenols': 2.8,
'flavanoids': 3.06,
'nonflavanoid_phenols': 0.28,
'proanthocyanins': 2.29,
'color_intensity': 5.64,
'hue': 1.04,
'od280/od315_of_diluted_wines': 3.92,
'proline': 1065.0}
Passing the feature names to the strip
function will add a function that maps the features to a list.
naked.strip(model, input_names=dataset.feature_names)
def handle_input_names(x):
names = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
return [x[name] for name in names]
def standard_scaler(x):
mean_ = [13.000617977528083, 2.336348314606741, 2.3665168539325854, 19.49494382022472, 99.74157303370787, 2.295112359550562, 2.0292696629213474, 0.36185393258426973, 1.5908988764044953, 5.058089882022473, 0.9574494382022468, 2.6116853932584254, 746.8932584269663]
var_ = [0.6553597304633259, 1.241004080924126, 0.07484180027774268, 11.090030614821362, 202.84332786264366, 0.3894890323191514, 0.9921135115515715, 0.015401619113748266, 0.32575424820098453, 5.344255847629093, 0.05195144969069561, 0.5012544628203511, 98609.60096578706]
with_mean = True
with_std = True
def scale(x, m, v):
if with_mean:
x -= m
if with_std:
x /= v ** .5
return x
return [scale(xi, m, v) for xi, m, v in zip(x, mean_, var_)]
def logistic_regression(x):
coef_ = [[0.8101347947338147, 0.20382073148760085, 0.47221241678911957, -0.8447843882542064, 0.04952904623674445, 0.21372479616642068, 0.6478750705319883, -0.19982499112990385, 0.13833867563545404, 0.17160966151451867, 0.13090887117218597, 0.7259506896985365, 1.07895948707047], [-1.0103233753629153, -0.44045952703036084, -0.8480739967718842, 0.5835732316278703, -0.09770602368275362, 0.027527982220605866, 0.35399157401383297, 0.21278279386396404, 0.2633610495737497, -1.0412707677956505, 0.6825215991118386, 0.05287634940648419, -1.1407929345327175], [0.20018858062910203, 0.23663879554275832, 0.37586157998276365, 0.26121115662633365, 0.048176977446007865, -0.2412527783870254, -1.0018666445458222, -0.012957802734061021, -0.40169972520920566, 0.8696611062811332, -0.8134304702840255, -0.7788270391050198, 0.061833447462247046]]
intercept_ = [0.41229358315867787, 0.7048164121833935, -1.1171099953420585]
import math
logits = [
b + sum(xi * wi for xi, wi in zip(x, w))
for w, b in zip(coef_, intercept_)
]
# Sigmoid activation for binary classification
if len(logits) == 1:
p_true = 1 / (1 + math.exp(-logits[0]))
return [1 - p_true, p_true]
# Softmax activation for multi-class classification
z_max = max(logits)
exp = [math.exp(z - z_max) for z in logits]
exp_sum = sum(exp)
return [e / exp_sum for e in exp]
def pipeline(x):
x = handle_input_names(x)
x = standard_scaler(x)
x = logistic_regression(x)
return x
What about output names?
You can also specify the output_names
parameter to associate each output value with a name. Of course, this doesn't work for cases where a single value is produced, such as single-target regression.
naked.strip(model, input_names=dataset.feature_names, output_names=dataset.target_names)
def handle_input_names(x):
names = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
return [x[name] for name in names]
def standard_scaler(x):
mean_ = [13.000617977528083, 2.336348314606741, 2.3665168539325854, 19.49494382022472, 99.74157303370787, 2.295112359550562, 2.0292696629213474, 0.36185393258426973, 1.5908988764044953, 5.058089882022473, 0.9574494382022468, 2.6116853932584254, 746.8932584269663]
var_ = [0.6553597304633259, 1.241004080924126, 0.07484180027774268, 11.090030614821362, 202.84332786264366, 0.3894890323191514, 0.9921135115515715, 0.015401619113748266, 0.32575424820098453, 5.344255847629093, 0.05195144969069561, 0.5012544628203511, 98609.60096578706]
with_mean = True
with_std = True
def scale(x, m, v):
if with_mean:
x -= m
if with_std:
x /= v ** .5
return x
return [scale(xi, m, v) for xi, m, v in zip(x, mean_, var_)]
def logistic_regression(x):
coef_ = [[0.8101347947338147, 0.20382073148760085, 0.47221241678911957, -0.8447843882542064, 0.04952904623674445, 0.21372479616642068, 0.6478750705319883, -0.19982499112990385, 0.13833867563545404, 0.17160966151451867, 0.13090887117218597, 0.7259506896985365, 1.07895948707047], [-1.0103233753629153, -0.44045952703036084, -0.8480739967718842, 0.5835732316278703, -0.09770602368275362, 0.027527982220605866, 0.35399157401383297, 0.21278279386396404, 0.2633610495737497, -1.0412707677956505, 0.6825215991118386, 0.05287634940648419, -1.1407929345327175], [0.20018858062910203, 0.23663879554275832, 0.37586157998276365, 0.26121115662633365, 0.048176977446007865, -0.2412527783870254, -1.0018666445458222, -0.012957802734061021, -0.40169972520920566, 0.8696611062811332, -0.8134304702840255, -0.7788270391050198, 0.061833447462247046]]
intercept_ = [0.41229358315867787, 0.7048164121833935, -1.1171099953420585]
import math
logits = [
b + sum(xi * wi for xi, wi in zip(x, w))
for w, b in zip(coef_, intercept_)
]
# Sigmoid activation for binary classification
if len(logits) == 1:
p_true = 1 / (1 + math.exp(-logits[0]))
return [1 - p_true, p_true]
# Softmax activation for multi-class classification
z_max = max(logits)
exp = [math.exp(z - z_max) for z in logits]
exp_sum = sum(exp)
return [e / exp_sum for e in exp]
def handle_output_names(x):
names = ['class_0' 'class_1' 'class_2']
return dict(zip(names, x))
def pipeline(x):
x = handle_input_names(x)
x = standard_scaler(x)
x = logistic_regression(x)
x = handle_output_names(x)
return x
As you can see, by specifying input_names
as well as output_names
, we obtain a pipeline of functions which takes as input a dictionary and produces a dictionary.
Development workflow
git clone https://github.com/MaxHalford/naked
cd naked
poetry install
poetry shell
pytest
Things to do
- Implement more models. For instance it should quite straightforward to support LightGBM.
- Remove useless branching conditions. Parameters are currently handled via
if
statements. Ideally it would be nice to remove theif
statements and only keep the code that will actually run.
License
MIT