naked is a Python tool which allows you to strip a model and only keep what matters for making predictions.

Related tags

Deep Learning naked
Overview

naked

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions. The result is a pure Python function with no third-party dependencies that you can simply copy/paste wherever you wish.

This is simpler than deploying an API endpoint or loading a serialized model. The jury is still out on whether this is sane or not. Of course I'm not the first one to have done this, for instance see sklearn-porter.

Installation

pip install git+https://github.com/MaxHalford/naked

Examples

sklearn.linear_model.LinearRegression

First, we fit a model.

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
lin_reg = LinearRegression().fit(X, y)
lin_reg.fit(X, y)

Then, we strip it.

import naked

print(naked.strip(lin_reg))

Which produces the following output.

def linear_regression(x):

    coef_ = [1.0000000000000002, 1.9999999999999991]
    intercept_ = 3.0000000000000018

    return intercept_ + sum(xi * wi for xi, wi in enumerate(coef_))

sklearn.pipeline.Pipeline

import naked
from sklearn import linear_model
from sklearn import feature_extraction
from sklearn import pipeline
from sklearn import preprocessing

model = pipeline.make_pipeline(
    feature_extraction.text.TfidfVectorizer(),
    preprocessing.Normalizer(),
    linear_model.LogisticRegression(solver='liblinear')
)

docs = ['Sad', 'Angry', 'Happy', 'Joyful']
is_positive = [False, False, True, True]

model.fit(docs, is_positive)

print(naked.strip(model))

This produces the following output.

def tfidf_vectorizer(x):

    lowercase = True
    norm = 'l2'
    vocabulary_ = {'sad': 3, 'angry': 0, 'happy': 1, 'joyful': 2}
    idf_ = [1.916290731874155, 1.916290731874155, 1.916290731874155, 1.916290731874155]

    import re

    if lowercase:
        x = x.lower()

    # Tokenize
    x = re.findall(r"(?u)\b\w\w+\b", x)
    x = [xi for xi in x if len(xi) > 1]

    # Count term frequencies
    from collections import Counter
    tf = Counter(x)
    total = sum(tf.values())

    # Compute the TF-IDF of each tokenized term
    tfidf = [0] * len(vocabulary_)
    for term, freq in tf.items():
        try:
            index = vocabulary_[term]
        except KeyError:
            continue
        tfidf[index] = freq * idf_[index] / total

    # Apply normalization
    if norm == 'l2':
        norm_val = sum(xi ** 2 for xi in tfidf) ** .5

    return [v / norm_val for v in tfidf]

def normalizer(x):

    norm = 'l2'

    if norm == 'l2':
        norm_val = sum(xi ** 2 for xi in x) ** .5
    elif norm == 'l1':
        norm_val = sum(abs(xi) for xi in x)
    elif norm == 'max':
        norm_val = max(abs(xi) for xi in x)

    return [xi / norm_val for xi in x]

def logistic_regression(x):

    coef_ = [[-0.40105811611957726, 0.40105811611957726, 0.40105811611957726, -0.40105811611957726]]
    intercept_ = [0.0]

    import math

    logits = [
        b + sum(xi * wi for xi, wi in zip(x, w))
        for w, b in zip(coef_, intercept_)
    ]

    # Sigmoid activation for binary classification
    if len(logits) == 1:
        p_true = 1 / (1 + math.exp(-logits[0]))
        return [1 - p_true, p_true]

    # Softmax activation for multi-class classification
    z_max = max(logits)
    exp = [math.exp(z - z_max) for z in logits]
    exp_sum = sum(exp)
    return [e / exp_sum for e in exp]

def pipeline(x):
    x = tfidf_vectorizer(x)
    x = normalizer(x)
    x = logistic_regression(x)
    return x

FAQ

What models are supported?

>>> import naked
>>> print(naked.AVAILABLE)
sklearn
    LinearRegression
    LogisticRegression
    Normalizer
    StandardScaler
    TfidfVectorizer

Will this work for all library versions?

Not by design. A release of naked is intended to support a library above a particular version. If we notice that naked doesn't work for a newer version of a given library, then a new version of naked should be released to handle said library version. You may refer to the pyproject.toml file to view library support.

How can I trust this is correct?

This package is really easy to unit test. One simply has to compare the outputs of the model with its "naked" version and check that the outputs are identical. Check out the test_naked.py file if you're curious.

How should I handle feature names?

Let's take the example of a multi-class logistic regression trained on the wine dataset.

from sklearn import datasets
from sklearn import linear_model
from sklearn import pipeline
from sklearn import preprocessing

dataset = datasets.load_wine()
X = dataset.data
y = dataset.target
model = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression()
)
model.fit(X, y)

By default, the strip function produces a function that takes as input a list of feature values. Instead, let's say we want to evaluate the function on a dictionary of features, thus associating each feature value with a name.

x = dict(zip(dataset.feature_names, X[0]))
print(x)
{'alcohol': 14.23,
 'malic_acid': 1.71,
 'ash': 2.43,
 'alcalinity_of_ash': 15.6,
 'magnesium': 127.0,
 'total_phenols': 2.8,
 'flavanoids': 3.06,
 'nonflavanoid_phenols': 0.28,
 'proanthocyanins': 2.29,
 'color_intensity': 5.64,
 'hue': 1.04,
 'od280/od315_of_diluted_wines': 3.92,
 'proline': 1065.0}

Passing the feature names to the strip function will add a function that maps the features to a list.

naked.strip(model, input_names=dataset.feature_names)
def handle_input_names(x):
    names = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
    return [x[name] for name in names]

def standard_scaler(x):

    mean_ = [13.000617977528083, 2.336348314606741, 2.3665168539325854, 19.49494382022472, 99.74157303370787, 2.295112359550562, 2.0292696629213474, 0.36185393258426973, 1.5908988764044953, 5.058089882022473, 0.9574494382022468, 2.6116853932584254, 746.8932584269663]
    var_ = [0.6553597304633259, 1.241004080924126, 0.07484180027774268, 11.090030614821362, 202.84332786264366, 0.3894890323191514, 0.9921135115515715, 0.015401619113748266, 0.32575424820098453, 5.344255847629093, 0.05195144969069561, 0.5012544628203511, 98609.60096578706]
    with_mean = True
    with_std = True

    def scale(x, m, v):
        if with_mean:
            x -= m
        if with_std:
            x /= v ** .5
        return x

    return [scale(xi, m, v) for xi, m, v in zip(x, mean_, var_)]

def logistic_regression(x):

    coef_ = [[0.8101347947338147, 0.20382073148760085, 0.47221241678911957, -0.8447843882542064, 0.04952904623674445, 0.21372479616642068, 0.6478750705319883, -0.19982499112990385, 0.13833867563545404, 0.17160966151451867, 0.13090887117218597, 0.7259506896985365, 1.07895948707047], [-1.0103233753629153, -0.44045952703036084, -0.8480739967718842, 0.5835732316278703, -0.09770602368275362, 0.027527982220605866, 0.35399157401383297, 0.21278279386396404, 0.2633610495737497, -1.0412707677956505, 0.6825215991118386, 0.05287634940648419, -1.1407929345327175], [0.20018858062910203, 0.23663879554275832, 0.37586157998276365, 0.26121115662633365, 0.048176977446007865, -0.2412527783870254, -1.0018666445458222, -0.012957802734061021, -0.40169972520920566, 0.8696611062811332, -0.8134304702840255, -0.7788270391050198, 0.061833447462247046]]
    intercept_ = [0.41229358315867787, 0.7048164121833935, -1.1171099953420585]

    import math

    logits = [
        b + sum(xi * wi for xi, wi in zip(x, w))
        for w, b in zip(coef_, intercept_)
    ]

    # Sigmoid activation for binary classification
    if len(logits) == 1:
        p_true = 1 / (1 + math.exp(-logits[0]))
        return [1 - p_true, p_true]

    # Softmax activation for multi-class classification
    z_max = max(logits)
    exp = [math.exp(z - z_max) for z in logits]
    exp_sum = sum(exp)
    return [e / exp_sum for e in exp]

def pipeline(x):
    x = handle_input_names(x)
    x = standard_scaler(x)
    x = logistic_regression(x)
    return x

What about output names?

You can also specify the output_names parameter to associate each output value with a name. Of course, this doesn't work for cases where a single value is produced, such as single-target regression.

naked.strip(model, input_names=dataset.feature_names, output_names=dataset.target_names)
def handle_input_names(x):
    names = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
    return [x[name] for name in names]

def standard_scaler(x):

    mean_ = [13.000617977528083, 2.336348314606741, 2.3665168539325854, 19.49494382022472, 99.74157303370787, 2.295112359550562, 2.0292696629213474, 0.36185393258426973, 1.5908988764044953, 5.058089882022473, 0.9574494382022468, 2.6116853932584254, 746.8932584269663]
    var_ = [0.6553597304633259, 1.241004080924126, 0.07484180027774268, 11.090030614821362, 202.84332786264366, 0.3894890323191514, 0.9921135115515715, 0.015401619113748266, 0.32575424820098453, 5.344255847629093, 0.05195144969069561, 0.5012544628203511, 98609.60096578706]
    with_mean = True
    with_std = True

    def scale(x, m, v):
        if with_mean:
            x -= m
        if with_std:
            x /= v ** .5
        return x

    return [scale(xi, m, v) for xi, m, v in zip(x, mean_, var_)]

def logistic_regression(x):

    coef_ = [[0.8101347947338147, 0.20382073148760085, 0.47221241678911957, -0.8447843882542064, 0.04952904623674445, 0.21372479616642068, 0.6478750705319883, -0.19982499112990385, 0.13833867563545404, 0.17160966151451867, 0.13090887117218597, 0.7259506896985365, 1.07895948707047], [-1.0103233753629153, -0.44045952703036084, -0.8480739967718842, 0.5835732316278703, -0.09770602368275362, 0.027527982220605866, 0.35399157401383297, 0.21278279386396404, 0.2633610495737497, -1.0412707677956505, 0.6825215991118386, 0.05287634940648419, -1.1407929345327175], [0.20018858062910203, 0.23663879554275832, 0.37586157998276365, 0.26121115662633365, 0.048176977446007865, -0.2412527783870254, -1.0018666445458222, -0.012957802734061021, -0.40169972520920566, 0.8696611062811332, -0.8134304702840255, -0.7788270391050198, 0.061833447462247046]]
    intercept_ = [0.41229358315867787, 0.7048164121833935, -1.1171099953420585]

    import math

    logits = [
        b + sum(xi * wi for xi, wi in zip(x, w))
        for w, b in zip(coef_, intercept_)
    ]

    # Sigmoid activation for binary classification
    if len(logits) == 1:
        p_true = 1 / (1 + math.exp(-logits[0]))
        return [1 - p_true, p_true]

    # Softmax activation for multi-class classification
    z_max = max(logits)
    exp = [math.exp(z - z_max) for z in logits]
    exp_sum = sum(exp)
    return [e / exp_sum for e in exp]

def handle_output_names(x):
    names = ['class_0' 'class_1' 'class_2']
    return dict(zip(names, x))

def pipeline(x):
    x = handle_input_names(x)
    x = standard_scaler(x)
    x = logistic_regression(x)
    x = handle_output_names(x)
    return x

As you can see, by specifying input_names as well as output_names, we obtain a pipeline of functions which takes as input a dictionary and produces a dictionary.

Development workflow

git clone https://github.com/MaxHalford/naked
cd naked
poetry install
poetry shell
pytest

Things to do

  • Implement more models. For instance it should quite straightforward to support LightGBM.
  • Remove useless branching conditions. Parameters are currently handled via if statements. Ideally it would be nice to remove the if statements and only keep the code that will actually run.

License

MIT

You might also like...
The Pytorch code of
The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

DeepBDC for few-shot learning        Introduction In this repo, we provide the implementation of the following paper: "Joint Distribution Matters: Dee

A python-image-classification web application project, written in Python and served through the Flask Microframework. This Project implements the VGG16 covolutional neural network, through Keras and Tensorflow wrappers, to make predictions on uploaded images. Nb workflows - A workflow platform which allows you to run parameterized notebooks programmatically
Nb workflows - A workflow platform which allows you to run parameterized notebooks programmatically

NB Workflows Description If SQL is a lingua franca for querying data, Jupyter sh

Using python and scikit-learn to make stock predictions

MachineLearningStocks in python: a starter project and guide EDIT as of Feb 2021: MachineLearningStocks is no longer actively maintained MachineLearni

Neural Ensemble Search for Performant and Calibrated Predictions
Neural Ensemble Search for Performant and Calibrated Predictions

Neural Ensemble Search Introduction This repo contains the code accompanying the paper: Neural Ensemble Search for Performant and Calibrated Predictio

Implementation of the bachelor's thesis "Real-time stock predictions with deep learning and news scraping".

Real-time stock predictions with deep learning and news scraping This repository contains a partial implementation of my bachelor's thesis "Real-time

Systematic generalisation with group invariant predictions

Requirements are Python 3, TensorFlow v1.14, Numpy, Scipy, Scikit-Learn, Matplotlib, Pillow, Scikit-Image, h5py, tqdm. Experiments were run on V100 GPUs (16 and 32GB).

🔮 Execution time predictions for deep neural network training iterations across different GPUs.

Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training Habitat is a tool that predicts a deep neural network's

Framework that uses artificial intelligence applied to mathematical models to make predictions
Framework that uses artificial intelligence applied to mathematical models to make predictions

LiconIA Framework that uses artificial intelligence applied to mathematical models to make predictions Interface Overview Table of contents [TOC] 1 Ar

Owner
Max Halford
Data wizard @alan-eu. PhD in machine learning applied to query optimization. Kaggle competitions Master. Online machine learning nut.
Max Halford
This repository implements and evaluates convolutional networks on the Möbius strip as toy model instantiations of Coordinate Independent Convolutional Networks.

Orientation independent Möbius CNNs This repository implements and evaluates convolutional networks on the Möbius strip as toy model instantiations of

Maurice Weiler 59 Dec 9, 2022
ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin et al., 2020).

ReConsider ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin

Facebook Research 47 Jul 26, 2022
Stroke-predictions-ml-model - Machine learning model to predict individuals chances of having a stroke

stroke-predictions-ml-model machine learning model to predict individuals chance

Alex Volchek 1 Jan 3, 2022
Monify: an Expense tracker Program implemented in a Graphical User Interface that allows users to keep track of their expenses

?? MONIFY (EXPENSE TRACKER PRO) ?? Description Monify is an Expense tracker Program implemented in a Graphical User Interface allows users to add inco

Moyosore Weke 1 Dec 14, 2021
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

[TCSVT] Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization LPN [Paper] NEWs Prerequisites Python 3.6 GPU Memory >= 8G Numpy > 1.

null 46 Dec 14, 2022
Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021) 99% of the code in this repository originates from this link. ICCV 2021 pap

Jeesoo Kim 10 Feb 1, 2022
Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Duong H. Le 18 Jun 13, 2022
Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

2D-TAN (Optimized) Introduction This is an optimized re-implementation repository for AAAI'2020 paper: Learning 2D Temporal Localization Networks for

Joya Chen 112 Dec 31, 2022