Fastshap: A fast, approximate shap kernel

Samuel Wilson

Last update: Sep 24, 2022

Related tags

Deep Learning fastshap

Overview

fastshap: A fast, approximate shap kernel

fastshap was designed to be:

Fast Calculating shap values can take an extremely long time. fastshap utilizes inner and outer batch assignments to keep the calculations inside vectorized operations as often as it can.
Used on Tabular Data Can accept numpy arrays or pandas DataFrames, and can handle categorical variables natively. As of right now, only 1 dimensional outputs are accepted.

WARNING This package specifically offers a kernel explainer, which can calculate approximate shap values of f(X) towards y for any function f. Much faster shap solutions are available specifically for gradient boosted trees.

Installation

This package can be installed using either pip or conda, through conda-forge:

# Using pip
$ pip install fastshap --no-cache-dir

You can also download the latest development version from this repository. If you want to install from github with conda, you must first run conda install pip git.

$ pip install git+https://github.com/AnotherSamWilson/fastshap.git

Basic Usage

We will use the iris dataset for this example. Here, we load the data and train a simple lightgbm model on the dataset:

from sklearn.datasets import load_iris
import pandas as pd
import lightgbm as lgb
import numpy as np

# Define our dataset and target variable
data = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
data.rename({"target": "species"}, inplace=True, axis=1)
data["species"] = data["species"].astype("category")
target = data.pop("sepal length (cm)")

# Train our model
dtrain = lgb.Dataset(data=data, label=target)
lgbmodel = lgb.train(
    params={"seed": 1, "verbose": -1},
    train_set=dtrain,
    num_boost_round=10
)

# Define the function we wish to build shap values for.
model = lgbmodel.predict

preds = model(data)

We now have a model which takes a Pandas dataframe, and returns predictions. We can create an explainer that will use data as a background dataset to calculate the shap values of any dataset we wish:

import fastshap

ke = fastshap.KernelExplainer(model, data)
sv = ke.calculate_shap_values(data, verbose=False)

print(all(preds == sv.sum(1)))

## True

Stratifying the Background Set

We can select a subset of our data to act as a background set. By stratifying the background set on the results of the model output, we will usually get very similar results, while decreasing the caculation time drastically.

ke.stratify_background_set(5)
sv2 = ke.calculate_shap_values(
  data, 
  background_fold_to_use=0,
  verbose=False
)

print(np.abs(sv2 - sv).mean(0))

## [1.74764532e-03 1.61829094e-02 1.99534408e-03 4.02640884e-16
##  1.71084747e-02]

What we did is break up our background set into 10 different sets, stratified by the model output. We then used the first of these sets as our background set. We then compared the average difference between these shap values, and the shap values we obtained from using the entire dataset.

Choosing Batch Sizes

If the entire process was vectorized, it would require an array of size (# Samples * # Coalitions * # Background samples, # Columns). Where # Coalitions is the sum of the total number of coalitions that are going to be run. Even for small datasets, this becomes enormous. fastshap breaks this array up into chunks by splitting the process into a series of batches.

This is a list of the large arrays and their maximum size:

Global
- Mask Matrix (# Coalitions, # Columns) dtype = int8
Outer Batch
- Linear Targets (Total Coalition Combinations, Outer Batch Size) dtype = adaptive
Inner Batch
- Model Evaluation Features (Inner Batch Size, # background samples) dtype = adaptive

The adaptive datatypes of the arrays above will be matched to the data types of the model output. Therefore, if your model returns float32, these arrays will be stored as float32. The final, returned shap values will also be returned as the datatype returned by the model.

These theoretical sizes can be calculated directly so that the user can determine appropriate batch sizes for their machine:

# Combines our background data back into 1 DataFrame
ke.stratify_background_set(1)
(
    mask_matrix_size, 
    linear_target_size, 
    inner_model_eval_set_size
) = ke.get_theoretical_array_expansion_sizes(
    outer_batch_size=150,
    inner_batch_size=150,
    n_coalition_sizes=3,
    background_fold_to_use=None,
)

print(
  np.product(linear_target_size) + np.product(inner_model_eval_set_size)
)

## 92100

For the iris dataset, even if we sent the entire set (150 rows) through as one batch, we only need 92100 elements stored in arrays. This is manageable on most machines. However, this number grows extremely quickly with the samples and number of columns. It is highly advised to determine a good batch scheme before running this process.

Specifying a Custom Linear Model

Any linear model available from sklearn.linear_model can be used to calculate the shap values. If you wish for some sparsity in the shap values, you can use Lasso regression:

from sklearn.linear_model import Lasso

# Use our entire background set
ke.stratify_background_set(1)
sv_lasso = ke.calculate_shap_values(
  data, 
  background_fold_to_use=0,
  linear_model=Lasso(alpha=0.1),
  verbose=False
)

print(sv_lasso[0,:])

## [-0.         -0.33797832 -0.         -0.14634971  5.84333333]

The default model used is sklearn.linear_model.LinearRegression.

tinykernel - A minimal Python kernel so you can run Python in your Python

37 Dec 2, 2022

Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021) This repository is the official PyTorc

139 Dec 29, 2022

[ICCV 2021] Official Tensorflow Implementation for "Single Image Defocus Deblurring Using Kernel-Sharing Parallel Atrous Convolutions"

KPAC: Kernel-Sharing Parallel Atrous Convolutional block This repository contains the official Tensorflow implementation of the following paper: Singl

50 Dec 29, 2022

The code for the NSDI'21 paper "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing".

BMC The code for the NSDI'21 paper "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing". BibTex entry available here. B

383 Dec 16, 2022

Fuzzing the Kernel Using Unicornafl and AFL++

Unicorefuzz Fuzzing the Kernel using UnicornAFL and AFL++. For details, skim through the WOOT paper or watch this talk at CCCamp19. Is it any good? ye

283 Dec 26, 2022

A Kernel fuzzer focusing on race bugs

Razzer: Finding kernel race bugs through fuzzing Environment setup $ source scripts/envsetup.sh scripts/envsetup.sh sets up necessary environment var

Systems and Software Security Lab at Seoul National University (SNU)

328 Dec 26, 2022

Fuzzer for Linux Kernel Drivers

difuze: Fuzzer for Linux Kernel Drivers This repo contains all the sources (including setup scripts), you need to get difuze up and running. Tested on

344 Dec 27, 2022

$Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\$

Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method (NeurIPS 2021)

Skyformer This repository is the official implementation of Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr"om Method (NeurIPS 2021).

46 Sep 20, 2022

Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data based on Pytorch Framework

VFedPCA+VFedAKPCA This is the official source code for the Paper: Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-

9 Sep 18, 2022

Comments

Bug in slicing?

Should we use shap_values[:, :-2, :] instead of:

https://github.com/AnotherSamWilson/fastshap/blob/c0cf5c37fe2bdabebc0a0660a4f0757ccd857d72/fastshap/KernelExplainer.py#L416 ?

@AnotherSamWilson

opened by Oktai15 2
Install fails due to sklearn dependency

11:01:39 × python setup.py egg_info did not run successfully. 11:01:39 │ exit code: 1 11:01:39 ╰─> [18 lines of output] 11:01:39 The 'sklearn' PyPI package is deprecated, use 'scikit-learn' 11:01:39 rather than 'sklearn' for pip commands. 11:01:39
11:01:39 Here is how to fix this error in the main use cases: 11:01:39 - use 'pip install scikit-learn' rather than 'pip install sklearn' 11:01:39 - replace 'sklearn' by 'scikit-learn' in your pip requirements files 11:01:39 (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...) 11:01:39 - if the 'sklearn' package is used by one of your dependencies, 11:01:39 it would be great if you take some time to track which package uses 11:01:39 'sklearn' instead of 'scikit-learn' and report it to their issue tracker 11:01:39 - as a last resort, set the environment variable 11:01:39 SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error 11:01:39
11:01:39 More information is available at 11:01:39 https://github.com/scikit-learn/sklearn-pypi-package 11:01:39
11:01:39 If the previous advice does not cover your use case, feel free to report it at 11:01:39 https://github.com/scikit-learn/sklearn-pypi-package/issues/new 11:01:39 [end of output]

opened by Zahlii 0

Fastshap: A fast, approximate shap kernel

Related tags

Overview

fastshap: A fast, approximate shap kernel

Installation

Basic Usage

Stratifying the Background Set

Choosing Batch Sizes

Specifying a Custom Linear Model

You might also like...

tinykernel - A minimal Python kernel so you can run Python in your Python

Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

[ICCV 2021] Official Tensorflow Implementation for "Single Image Defocus Deblurring Using Kernel-Sharing Parallel Atrous Convolutions"

The code for the NSDI'21 paper "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing".

Fuzzing the Kernel Using Unicornafl and AFL++

A Kernel fuzzer focusing on race bugs

Fuzzer for Linux Kernel Drivers

Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method (NeurIPS 2021)

Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data based on Pytorch Framework

Comments

Bug in slicing?

Install fails due to sklearn dependency

Owner

Samuel Wilson

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

Optimal space decomposition based-product quantization for approximate nearest neighbor search

Convert Table data to approximate values with GUI

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Code for Mesh Convolution Using a Learned Kernel Basis

(CVPR 2021) PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds

Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Exploring Image Deblurring via Blur Kernel Space (CVPR'21)