icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

Overview

icepickle

It's a cooler way to store simple linear models.

The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models. Not only is this much safer, but it also allows for an interesting finetuning pattern that does not require a GPU.

Installation

You can install everything with pip:

python -m pip install icepickle

Usage

Let's say that you've gotten a linear model from scikit-learn trained on a dataset.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

clf = LogisticRegression()
clf.fit(X, y)

Then you could use a pickle to save the model.

from joblib import dump, load

# You can save the classifier.
dump(clf, 'classifier.joblib')

# You can load it too.
clf_reloaded = load('classifier.joblib')

But this is unsafe. The scikit-learn documentations even warns about the security concerns and compatibility issues. The goal of this package is to offer a safe alternative to pickling for simple linear models. The coefficients will be saved in a .h5 file and can be loaded into a new regression model later.

from icepickle.linear_model import save_coefficients, load_coefficients

# You can save the classifier.
save_coefficients(clf, 'classifier.h5')

# You can create a new model, with new hyperparams.
clf_reloaded = LogisticRegression()

# Load the previously trained weights in.
load_coefficients(clf_reloaded, 'classifier.h5')

This is a lot safer and there's plenty of use-cases that could be handled this way.

There's a cool finetuning-trick we can do now too!

Finetuning

Assuming that you use a stateless featurizer in your pipeline, such as HashingVectorizer or language models from whatlies, you choose to pre-train your scikit-learn model beforehand and fine-tune it later using models that offer the .partial_fit()-api. If you're unfamiliar with this api, you might appreciate this course on calmcode.

This library also comes with utilities that makes it easier to finetune systems via the .partial_fit() API. In particular we offer partial pipeline components via the icepickle.pipeline submodule.

import pandas as pd
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.feature_extraction.text import HashingVectorizer

from icepickle.linear_model import save_coefficients, load_coefficients
from icepickle.pipeline import make_partial_pipeline

url = "https://raw.githubusercontent.com/koaning/icepickle/main/datasets/imdb_subset.csv"
df = pd.read_csv(url)
X, y = list(df['text']), df['label']

# Train a pre-trained model.
pretrained = LogisticRegression()
pipe = make_partial_pipeline(HashingVectorizer(), pretrained)
pipe.fit(X, y)

# Save the coefficients, safely.
save_coefficients(pretrained, 'pretrained.h5')

# Create a new model using pre-trained weights.
finetuned = SGDClassifier()
load_coefficients(finetuned, 'pretrained.h5')
new_pipe = make_partial_pipeline(HashingVectorizer(), finetuned)

# This new model can be used for fine-tuning.
for i in range(10):
    # Inside this for-loop you could consider doing data-augmentation.
    new_pipe.partial_fit(X, y)
Supported Pipeline Parts

The following pipeline components are added.

from icepickle.pipeline import (
    PartialPipeline,
    PartialFeatureUnion,
    make_partial_pipeline,
    make_partial_union,
)

These tools allow you to declare pipelines that support .partial_fit. Note that components used in these pipelines all need to have .partial_fit() implemented.

Supported Scikit-Learn Models

We unit test against the following models in our save_coefficients and load_coefficients functions.

from sklearn.linear_model import (
    SGDClassifier,
    SGDRegressor,
    LinearRegression,
    LogisticRegression,
    PassiveAggressiveClassifier,
    PassiveAggressiveRegressor,
)
You might also like...
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

Automated Machine Learning with scikit-learn

auto-sklearn auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. Find the documentation here

Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

Distributed scikit-learn meta-estimators in PySpark
Distributed scikit-learn meta-estimators in PySpark

sk-dist: Distributed scikit-learn meta-estimators in PySpark What is it? sk-dist is a Python package for machine learning built on top of scikit-learn

Scikit-Learn useful pre-defined Pipelines Hub
Scikit-Learn useful pre-defined Pipelines Hub

Scikit-Pipes Scikit-Learn useful pre-defined Pipelines Hub Usage: Install scikit-pipes It's advised to install sklearn-genetic using a virtual env, in

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn
Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

Painless Machine Learning for python based on scikit-learn

PlainML Painless Machine Learning Library for python based on scikit-learn. Install pip install plainml Example from plainml import KnnModel, load_ir

MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

Comments
  • [QUESTION] Finetuning on different model

    [QUESTION] Finetuning on different model

    Hello, I'm confused about this part of the code explaining fine-tuning on new model. https://github.com/koaning/icepickle/blob/3472aa4bf8071b62755737d3f446d1315d459474/README.md?plain=1#L102-L112 Looking at this code example, you're doing pretraining on LogisticRegression model and then load the coefficient on SGDClassifier, I don't understand the meaning of this process. Do you somehow adapt the coefficient to SGDClassifier, or do you create a LogisticRegression behind the SGDClassifier and just train the SGDClassifier with the new dataset using partial_fit?

    opened by banditelol 6
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

Lensa 1.1k Jan 4, 2023
Scikit learn library models to account for data and concept drift.

liquid_scikit_learn Scikit learn library models to account for data and concept drift. This python library focuses on solving data drift and concept d

null 7 Nov 18, 2021
Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Highly interpretable, sklearn-compatible classifier based on decision rules This is a scikit-learn compatible wrapper for the Bayesian Rule List class

Tamas Madl 482 Nov 19, 2022
Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

Facebook 15.4k Jan 7, 2023
To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

Astitva Veer Garg 1 Jan 11, 2022
Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

Siva Prakash 5 Apr 5, 2022
A collection of Scikit-Learn compatible time series transformers and tools.

tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

Chris Santiago 0 Mar 30, 2022
Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

Penguins Classification App Penguins species predictor app is used to classify penguins species using their island, sex, bill length (mm), bill depth

Siva Prakash 3 Apr 5, 2022
Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
K-Means clusternig example with Python and Scikit-learn

Unsupervised-Machine-Learning Flat Clustering K-Means clusternig example with Python and Scikit-learn Flat clustering Clustering algorithms group a se

Emin 1 Dec 13, 2021