A python library for decision tree visualization and model interpretation.

Terence Parr

Last update: Jan 2, 2023

Related tags

Deep Learning Model Explanation dtreeviz

Overview

dtreeviz : Decision Tree Visualization

Description

A python library for decision tree visualization and model interpretation. Currently supports scikit-learn, XGBoost, Spark MLlib, and LightGBM trees. With 1.3, we now provide one- and two-dimensional feature space illustrations for classifiers (any model that can answer predict_probab()); see below.

Authors:

Terence Parr, a professor in the University of San Francisco's data science program
Tudor Lapusan
Prince Grover

See How to visualize decision trees for deeper discussion of our decision tree visualization library and the visual design decisions we made.

Feedback

We welcome info from users on how they use dtreeviz, what features they'd like, etc... via email (to parrt) or via an issue.

Quick start

Jump right into the examples using this Colab notebook

Take a look in notebooks! Here we have a specific notebook for all supported ML libraries and more.

Discussion

Decision trees are the fundamental building block of gradient boosting machines and Random Forests(tm), probably the two most popular machine learning models for structured data. Visualizing decision trees is a tremendous aid when learning how these models work and when interpreting models. Unfortunately, current visualization packages are rudimentary and not immediately helpful to the novice. For example, we couldn't find a library that visualizes how decision nodes split up the feature space. It is also uncommon for libraries to support visualizing a specific feature vector as it weaves down through a tree's decision nodes; we could only find one image showing this.

So, we've created a general package for decision tree visualization and model interpretation, which we'll be using heavily in an upcoming machine learning book (written with Jeremy Howard).

The visualizations are inspired by an educational animation by R2D3; A visual introduction to machine learning. With dtreeviz, you can visualize how the feature space is split up at decision nodes, how the training samples get distributed in leaf nodes, how the tree makes predictions for a specific observation and more. These operations are critical to for understanding how classification or regression decision trees work. If you're not familiar with decision trees, check out fast.ai's Introduction to Machine Learning for Coders MOOC.

Install

Install anaconda3 on your system, if not already done.

You might verify that you do not have conda-installed graphviz-related packages installed because dtreeviz needs the pip versions; you can remove them from conda space by doing:

conda uninstall python-graphviz
conda uninstall graphviz

To install (Python >=3.6 only), do this (from Anaconda Prompt on Windows!):

pip install dtreeviz             # install dtreeviz for sklearn
pip install dtreeviz[xgboost]    # install XGBoost related dependency
pip install dtreeviz[pyspark]    # install pyspark related dependency
pip install dtreeviz[lightgbm]   # install LightGBM related dependency

This should also pull in the graphviz Python library (>=0.9), which we are using for platform specific stuff.

Limitations. Only svg files can be generated at this time, which reduces dependencies and dramatically simplifies install process.

Please email Terence with any helpful notes on making dtreeviz work (better) on other platforms. Thanks!

For your specific platform, please see the following subsections.

Mac

Make sure to have the latest XCode installed and command-line tools installed. You can run xcode-select --install from the command-line to install those if XCode is already installed. You also have to sign the XCode license agreement, which you can do with sudo xcodebuild -license from command-line. The brew install shown next needs to build graphviz, so you need XCode set up properly.

You need the graphviz binary for dot. Make sure you have latest version (verified on 10.13, 10.14):

brew reinstall graphviz

Just to be sure, remove dot from any anaconda installation, for example:

rm ~/anaconda3/bin/dot

From command line, this command

dot -Tsvg

should work, in the sense that it just stares at you without giving an error. You can hit control-C to escape back to the shell. Make sure that you are using the right dot as installed by brew:

$ which dot
/usr/local/bin/dot
$ ls -l $(which dot)
lrwxr-xr-x  1 parrt  wheel  33 May 26 11:04 /usr/local/bin/dot@ -> ../Cellar/graphviz/2.40.1/bin/dot
$

Limitations. Jupyter notebook has a bug where they do not show .svg files correctly, but Juypter Lab has no problem.

Linux (Ubuntu 18.04)

To get the dot binary do:

sudo apt install graphviz

Limitations. The view() method works to pop up a new window and images appear inline for jupyter notebook but not jupyter lab (It gets an error parsing the SVG XML.) The notebook images also have a font substitution from the Arial we use and so some text overlaps. Only .svg files can be generated on this platform.

Windows 10

(Make sure to pip install graphviz, which is common to all platforms, and make sure to do this from Anaconda Prompt on Windows!)

Download graphviz-2.38.msi and update your Path environment variable. Add C:\Program Files (x86)\Graphviz2.38\bin to User path and C:\Program Files (x86)\Graphviz2.38\bin\dot.exe to System Path. It's windows so you might need a reboot after updating that environment variable. You should see this from the Anaconda Prompt:

(base) C:\Users\Terence Parr>where dot
C:\Program Files (x86)\Graphviz2.38\bin\dot.exe

(Do not use conda install -c conda-forge python-graphviz as you get an old version of graphviz python library.)

Verify from the Anaconda Prompt that this works (capital -V not lowercase -v):

dot -V

If it doesn't work, you have a Path problem. I found the following test programs useful. The first one sees if Python can find dot:

import os
import subprocess
proc = subprocess.Popen(['dot','-V'])
print( os.getenv('Path') )

The following version does the same thing except uses graphviz Python libraries backend support utilities, which is what we use in dtreeviz:

import graphviz.backend as be
cmd = ["dot", "-V"]
stdout, stderr = be.run(cmd, capture_output=True, check=True, quiet=False)
print( stderr )

If you are having issues with run command you can try copying the following files from: https://github.com/xflr6/graphviz/tree/master/graphviz.

Place them in the AppData\Local\Continuum\anaconda3\Lib\site-packages\graphviz folder.

Clean out the pycache directory too.

Jupyter Lab and Jupyter notebook both show the inline .svg images well.

Verify graphviz installation

Try making text file t.dot with content digraph T { A -> B } (paste that into a text editor, for example) and then running this from the command line:

dot -Tsvg -o t.svg t.dot

That should give a simple t.svg file that opens properly. If you get errors from dot, it will not work from the dtreeviz python code. If it can't find dot then you didn't update your PATH environment variable or there is some other install issue with graphviz.

Limitations

Finally, don't use IE to view .svg files. Use Edge as they look much better. I suspect that IE is displaying them as a rasterized not vector images. Only .svg files can be generated on this platform.

Usage

dtree: Main function to create decision tree visualization. Given a decision tree regressor or classifier, creates and returns a tree visualization using the graphviz (DOT) language.

Required libraries

Basic libraries and imports that will (might) be needed to generate the sample visualizations shown in examples below.

from sklearn.datasets import *
from sklearn import tree
from dtreeviz.trees import *

Regression decision tree

The default orientation of tree is top down but you can change it to left to right using orientation="LR". view() gives a pop up window with rendered graphviz object.

regr = tree.DecisionTreeRegressor(max_depth=2)
boston = load_boston()
regr.fit(boston.data, boston.target)

viz = dtreeviz(regr,
               boston.data,
               boston.target,
               target_name='price',
               feature_names=boston.feature_names)
              
viz.view()

Classification decision tree

An additional argument of class_names giving a mapping of class value with class name is required for classification trees.

classifier = tree.DecisionTreeClassifier(max_depth=2)  # limit depth of tree
iris = load_iris()
classifier.fit(iris.data, iris.target)

viz = dtreeviz(classifier, 
               iris.data, 
               iris.target,
               target_name='variety',
               feature_names=iris.feature_names, 
               class_names=["setosa", "versicolor", "virginica"]  # need class_names for classifier
              )  
              
viz.view()

Prediction path

Highlights the decision nodes in which the feature value of single observation passed in argument X falls. Gives feature values of the observation and highlights features which are used by tree to traverse path.

regr = tree.DecisionTreeRegressor(max_depth=2)  # limit depth of tree
diabetes = load_diabetes()
regr.fit(diabetes.data, diabetes.target)
X = diabetes.data[np.random.randint(0, len(diabetes.data)),:]  # random sample from training

viz = dtreeviz(regr,
               diabetes.data, 
               diabetes.target, 
               target_name='value', 
               orientation ='LR',  # left-right orientation
               feature_names=diabetes.feature_names,
               X=X)  # need to give single observation for prediction
              
viz.view()

If you want to visualize just the prediction path, you need to set parameter show_just_path=True

dtreeviz(regr,
        diabetes.data, 
        diabetes.target, 
        target_name='value', 
        orientation ='TD',  # top-down orientation
        feature_names=diabetes.feature_names,
        X=X, # need to give single observation for prediction
        show_just_path=True     
        )

Explain prediction path

These visualizations are useful to explain to somebody, without machine learning skills, why your model made that specific prediction.
In case of explanation_type=plain_english, it searches in prediction path and find feature value ranges.

X = dataset[features].iloc[10]
print(X)
Pclass              3.0
Age                 4.0
Fare               16.7
Sex_label           0.0
Cabin_label       145.0
Embarked_label      2.0

print(explain_prediction_path(tree_classifier, X, feature_names=features, explanation_type="plain_english"))
2.5 <= Pclass 
Age < 36.5
Fare < 23.35
Sex_label < 0.5

In case of explanation_type=sklearn_default (available only for scikit-learn), we can visualize the features' importance involved in prediction path only. Features' importance is calculated based on mean decrease in impurity.
Check Beware Default Random Forest Importances article for a comparison between features' importance based on mean decrease in impurity vs permutation importance.

explain_prediction_path(tree_classifier, X, feature_names=features, explanation_type="sklearn_default")

Decision tree without scatterplot or histograms for decision nodes

Simple tree without histograms or scatterplots for decision nodes. Use argument fancy=False

classifier = tree.DecisionTreeClassifier(max_depth=4)  # limit depth of tree
cancer = load_breast_cancer()
classifier.fit(cancer.data, cancer.target)

viz = dtreeviz(classifier,
              cancer.data,
              cancer.target,
              target_name='cancer',
              feature_names=cancer.feature_names, 
              class_names=["malignant", "benign"],
              fancy=False )  # fance=False to remove histograms/scatterplots from decision nodes
              
viz.view()

For more examples and different implementations, please see the jupyter notebook full of examples.

Regression univariate feature-target space

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *

df_cars = pd.read_csv("cars.csv")
X, y = df_cars[['WGT']], df_cars['MPG']

dt = DecisionTreeRegressor(max_depth=3, criterion="mae")
dt.fit(X, y)

fig = plt.figure()
ax = fig.gca()
rtreeviz_univar(dt, X, y, 'WGT', 'MPG', ax=ax)
plt.show()

Regression bivariate feature-target space

from mpl_toolkits.mplot3d import Axes3D
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *

df_cars = pd.read_csv("cars.csv")
X = df_cars[['WGT','ENG']]
y = df_cars['MPG']

dt = DecisionTreeRegressor(max_depth=3, criterion="mae")
dt.fit(X, y)

figsize = (6,5)
fig = plt.figure(figsize=figsize)
ax = fig.add_subplot(111, projection='3d')

t = rtreeviz_bivar_3D(dt,
                      X, y,
                      feature_names=['Vehicle Weight', 'Horse Power'],
                      target_name='MPG',
                      fontsize=14,
                      elev=20,
                      azim=25,
                      dist=8.2,
                      show={'splits','title'},
                      ax=ax)
plt.show()

Regression bivariate feature-target space heatmap

from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *

df_cars = pd.read_csv("cars.csv")
X = df_cars[['WGT','ENG']]
y = df_cars['MPG']

dt = DecisionTreeRegressor(max_depth=3, criterion="mae")
dt.fit(X, y)

t = rtreeviz_bivar_heatmap(dt,
                           X, y,
                           feature_names=['Vehicle Weight', 'Horse Power'],
                           fontsize=14)

plt.show()

Classification univariate feature-target space

from sklearn.tree import DecisionTreeClassifier
from dtreeviz.trees import *

know = pd.read_csv("knowledge.csv")
class_names = ['very_low', 'Low', 'Middle', 'High']
know['UNS'] = know['UNS'].map({n: i for i, n in enumerate(class_names)})

X = know[['PEG']]
y = know['UNS']

dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X, y)

ct = ctreeviz_univar(dt, X, y,
                     feature_names = ['PEG'],
                     class_names=class_names,
                     target_name='Knowledge',
                     nbins=40, gtype='strip',
                     show={'splits','title'})
plt.tight_layout()
plt.show()

Classification bivariate feature-target space

from sklearn.tree import DecisionTreeClassifier
from dtreeviz.trees import *

know = pd.read_csv("knowledge.csv")
print(know)
class_names = ['very_low', 'Low', 'Middle', 'High']
know['UNS'] = know['UNS'].map({n: i for i, n in enumerate(class_names)})

X = know[['PEG','LPR']]
y = know['UNS']

dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X, y)

ct = ctreeviz_bivar(dt, X, y,
                    feature_names = ['PEG','LPR'],
                    class_names=class_names,
                    target_name='Knowledge')
plt.tight_layout()
plt.show()

Leaf node purity

Leaf purity affects prediction confidence.
For classification leaf purity is calculated based on majority target class (gini, entropy) and for regression is calculated based on target variance values.
Leaves with low variance among the target values (regression) or an overwhelming majority target class (classification) are much more reliable predictors. When we have a decision tree with a high depth, it can be difficult to get an overview about all leaves purities. That's why we created a specialized visualization only for leaves purities.

display_type can take values 'plot' (default), 'hist' or 'text'

viz_leaf_criterion(tree_classifier, display_type = "plot")

Leaf node samples

It's also important to take a look at the number of samples from leaves. For example, we can have a leaf with a good purity but very few samples, which is a sign of overfitting. The ideal scenario would be to have a leaf with good purity which is based on a significant number of samples.

display_type can take values 'plot' (default), 'hist' or 'text'

viz_leaf_samples(tree_classifier, dataset[features], display_type='plot')

Leaf node samples for classification

This is a specialized visualization for classification. It helps also to see the distribution of target class values from leaf samples.

ctreeviz_leaf_samples(tree_classifier, dataset[features], dataset[target])

Leaf plots

Visualize leaf target distribution for regression decision trees.

viz_leaf_target(tree_regressor, dataset[features_reg], dataset[target_reg], features_reg, target_reg)

Classification boundaries in feature space

With 1.3, we have introduced method clfviz() that illustrates one and two-dimensional feature space for classifiers, including colors the represent probabilities, decision boundaries, and misclassified entities. This method works with any model that answers method predict_proba() (and we also support Keras), so any model from scikit-learn should work. If you let us know about incompatibilities, we can support more models. There are lots of options would you can check out in the api documentation. See classifier-decision-boundaries.ipynb and classifier-boundary-animations.ipynb.

clfviz(rf, X, y, feature_names=['x1', 'x2'], markers=['o','X','s','D'], target_name='smiley')

clfviz(rf,x,y,feature_names=['f27'],target_name='cancer')

clfviz(rf,x,y,
       feature_names=['x2'],
       target_name = 'smiley',
       colors={'scatter_marker_alpha':.2})

Sometimes it's helpful to see animations that change some of the hyper parameters. If you look in notebook classifier-boundary-animations.ipynb, you will see code that generates animations such as the following (animated png files):

Visualization methods setup

Starting with dtreeviz 1.0 version, we refactored the concept of ShadowDecTree. If we want to add a new ML library in dtreeviz, we just need to add a new implementation of ShadowDecTree API, like ShadowSKDTree, ShadowXGBDTree or ShadowSparkTree.

Initializing a ShadowSKDTree object:

sk_dtree = ShadowSKDTree(tree_classifier, dataset[features], dataset[target], features, target, [0, 1])

Once we have the object initialized, we can used it to create all the visualizations, like :

dtreeviz(sk_dtree)

viz_leaf_samples(sk_dtree)

viz_leaf_criterion(sk_dtree)

In this way, we reduced substantially the list of parameters required for each visualization and it's also more efficient in terms of computing power.

You can check the notebooks section for more examples of using ShadowSKDTree, ShadowXGBDTree or ShadowSparkTree.

Install dtreeviz locally

Make sure to follow the install guidelines above.

To push the dtreeviz library to your local egg cache (force updates) during development, do this (from anaconda prompt on Windows):

python setup.py install -f

E.g., on Terence's box, it add /Users/parrt/anaconda3/lib/python3.6/site-packages/dtreeviz-0.3-py3.6.egg.

Customize colors

Each function has an optional parameter colors which allows passing a dictionary of colors which is used in the plot. For an example of each parameter have a look at this notebook.

Example

dtreeviz.trees.dtreeviz(regr,
                        boston.data,
                        boston.target,
                        target_name='price',
                        feature_names=boston.feature_names,
                        colors={'scatter_marker': '#00ff00'})

would paint the scatter (dots) in red.

Parameters

The colors are defined in colors.py, all options and default parameters are shown below.

    COLORS = {'scatter_edge': GREY,         
              'scatter_marker': BLUE,
              'split_line': GREY,
              'mean_line': '#f46d43',
              'axis_label': GREY,
              'title': GREY,
              'legend_title': GREY,
              'legend_edge': GREY,
              'edge': GREY,
              'color_map_min': '#c7e9b4',
              'color_map_max': '#081d58',
              'classes': color_blind_friendly_colors,
              'rect_edge': GREY,
              'text': GREY,
              'highlight': HIGHLIGHT_COLOR,
              'wedge': WEDGE_COLOR,
              'text_wedge': WEDGE_COLOR,
              'arrow': GREY,
              'node_label': GREY,
              'tick_label': GREY,
              'leaf_label': GREY,
              'pie': GREY,
              }

The color needs be in a format matplotlib can interpret, e.g. a html hex like '#eeefff' .

classes needs to be a list of lists of colors with a minimum length of your number of colors. The index is the number of classes and the list with this index needs to have the same amount of colors.

Useful Resources

How to visualize decision trees
How to explain gradient boosting
The Mechanics of Machine Learning
Animation by R2D3
A visual introductionn to machine learning
fast.ai's Introduction to Machine Learning for Coders MOOC
Stef van den Elzen's Interactive Construction, Analysis and Visualization of Decision Trees
Some similar feature-space visualizations in Towards an effective cooperation of the user and the computer for classification, SIGKDD 2000
Beautiful Decisions: Inside BigML’s Decision Trees
"SunBurst" approach to tree visualization: An evaluation of space-filling information visualizations for depicting hierarchical structures

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the terms of the MIT license, see LICENSE.

Deploy

$ python setup.py sdist upload

Comments

MacOS: Error: invalid option: --with-librsvg

First off thank you for creating the dependency it looks incredible to use. But sadly I cant seem to install graphiz. Following the instructions for issue #23 I am confused with what I am supposed to do: 1.) I have xcode-select installed 2.) I ran "sudo xcodebuild -license" from the command-line (I dont understand why...just to confirm I got it I guess.) 3.) I run "brew uninstall graphviz" 4.) then finally run "brew install graphviz --with-librsvg --with-pango" in command-line

Then I get the following error Error: invalid option: --with-librsvg

Thank you in advance for any help.
cross-platform

opened by EricCacciavillani 58
support XGBoost and lightgbm?

It would be useful to support decision trees from gradient boosting machines. Should be a simple matter of creating a shadow tree from one of the trees in the boosted ensemble. Interested in this one @tlapusan ?
enhancement

opened by parrt 38
Support other tree models #83
Hi @parrt,

This is the PR for xgboost and new architecture.

it includes few changes about what we discussed yesterday, but is still work on progress on others.

sklearn and xgboost notebooks are adapted to new changes, so you can take a look on them.

Then PR is ready, close #83
enhancement
opened by tlapusan 32

IndexError: arrays used as indices must be of integer (or boolean) type

Sorry, I don't have code to reproduce this yet, but here's the stack trace I'm getting. It seems to occur when I have shallower trees, or maybe not enough samples on a leaf? Because DecisionTreeRegressor(min_samples_leaf=20) will make this error message go away.

--------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_6281/129338585.py in <module>
----> 1 dtreeviz(
      2     dt,
      3     cast(Any, val_x).values,
      4     val_y,
      5     target_name="y",

~/miniconda3/envs/kwds/lib/python3.9/site-packages/dtreeviz/trees.py in dtreeviz(tree_model, x_data, y_data, feature_names, target_name, class_names, tree_index, precision, orientation, instance_orientation, show_root_edge_labels, show_node_labels, show_just_path, fancy, histtype, highlight_path, X, max_X_features_LR, max_X_features_TD, depth_range_to_display, label_fontsize, ticks_fontsize, fontname, title, title_fontsize, colors, scale)
    812                                 highlight_node=node.id in highlight_path)
    813             else:
--> 814                 regr_split_viz(node, X_data, y_data,
    815                                filename=f"{tmp}/node{node.id}_{os.getpid()}.svg",
    816                                target_name=shadow_tree.target_name,

~/miniconda3/envs/kwds/lib/python3.9/site-packages/dtreeviz/trees.py in regr_split_viz(node, X_train, y_train, target_name, filename, y_range, ticks_fontsize, label_fontsize, fontname, precision, X, highlight_node, colors)
   1163 
   1164         ax.scatter(X_feature, y_train, s=5, c=colors['scatter_marker'], alpha=colors['scatter_marker_alpha'], lw=.3)
-> 1165         left, right = node.split_samples()
   1166         left = y_train[left]
   1167         right = y_train[right]

~/miniconda3/envs/kwds/lib/python3.9/site-packages/dtreeviz/models/shadow_decision_tree.py in split_samples(self)
    559         """Returns the list of indexes to the left and the right of the split value."""
    560 
--> 561         return self.shadow_tree.get_split_samples(self.id)
    562 
    563     def isleaf(self) -> bool:

~/miniconda3/envs/kwds/lib/python3.9/site-packages/dtreeviz/models/sklearn_decision_trees.py in get_split_samples(self, id)
     67     def get_split_samples(self, id):
     68         samples = np.array(self.get_node_samples()[id])
---> 69         node_X_data = self.x_data[samples, self.get_node_feature(id)]
     70         split = self.get_node_split(id)
     71 

IndexError: arrays used as indices must be of integer (or boolean) type

opened by munro 28

Make leaf split plot for regressors to show distribution of leaf sample y's

I wonder if a split plot per leaf would look ok; seems like it'd be useful, @tlapusan. Here is a sample split plot. Maybe one y range / axis on left and then tight grouping of split plots?
enhancement

opened by parrt 26

TypeError: unhashable type: 'numpy.ndarray'

I am having an issue with visualizing decision trees with dtreeviz. I am running this on Ubuntu 18.04 using an anaconda environment using Python 3.6.0 and Jupyter Notebook.

I used the following example with some changes.

regr = tree.DecisionTreeRegressor(max_depth=2)
boston = load_boston()
regr.fit(boston.data, boston.target)

viz = dtreeviz(regr,
               boston.data,
               boston.target,
               target_name='price',
               feature_names=boston.feature_names)
              
viz.view()

I tried to use the same code on my nutrient database.

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_graphviz

dtree = DecisionTreeRegressor()
dtree.fit(tree_X, tree_y)

from dtreeviz.trees import dtreeviz

viz = dtreeviz(dtree,
               tree_X,
               tree_y,
               target_name='Iron_(mg)',
               feature_names=['Fiber_TD_(g)', 'Carbohydrt_(g)'])
              
viz.view()

It returned the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-82-60a9d000cb22> in <module>
      5                tree_y,
      6                target_name='Iron',
----> 7                feature_names=np.array(['Fiber_TD_(g)', 'Carbohydrt_(g)'], dtype='<U17'))
      8 
      9 viz.view()

~/anaconda3/envs/playground/lib/python3.6/site-packages/dtreeviz/trees.py in dtreeviz(tree_model, X_train, y_train, feature_names, target_name, class_names, precision, orientation, show_root_edge_labels, show_node_labels, fancy, histtype, highlight_path, X, max_X_features_LR, max_X_features_TD)
    672 
    673     shadow_tree = ShadowDecTree(tree_model, X_train, y_train,
--> 674                                 feature_names=feature_names, class_names=class_names)
    675 
    676     if X is not None:

~/anaconda3/envs/playground/lib/python3.6/site-packages/dtreeviz/shadow.py in __init__(self, tree_model, X_train, y_train, feature_names, class_names)
     68         self.unique_target_values = np.unique(y_train)
     69         self.node_to_samples = ShadowDecTree.node_samples(tree_model, X_train)
---> 70         self.class_weights = compute_class_weight(tree_model.class_weight, self.unique_target_values, self.y_train)
     71 
     72         tree = tree_model.tree_

~/anaconda3/envs/playground/lib/python3.6/site-packages/sklearn/utils/class_weight.py in compute_class_weight(class_weight, classes, y)
     39     from ..preprocessing import LabelEncoder
     40 
---> 41     if set(y) - set(classes):
     42         raise ValueError("classes should include all valid labels that can "
     43                          "be in y")

TypeError: unhashable type: 'numpy.ndarray'

I thought it's because it wanted me to out in a numpy array type instead of a list. But after checking out dtreeviz in my notebook, it required a list for the argument.

But I tried it anyway by converting the list to a numpy array.

viz = dtreeviz(dtree,
               tree_X,
               tree_y,
               target_name='Iron_(mg)',
               feature_names=np.array(['Fiber_TD_(g)', 'Carbohydrt_(g)']))
              
viz.view()

But I still got the same error.

Then I tried to match the datatype the same as the boston datastet features name.

from sklearn.datasets import load_boston
boston = load_boston()
boston.feature_names

# Output
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'],
      dtype='<U7')

viz = dtreeviz(dtree,
               tree_X,
               tree_y,
               target_name='Iron_(mg)',
               feature_names=np.array(['Fiber_TD_(g)', 'Carbohydrt_(g)'], dtype='<U7'))
              
viz.view()

But I still got the same error every time. I'm not really sure what happened. I was wondering if there was something I was missing?

bug can't reproduce

opened by grilhami 24

!pip install dtreeviz fails on Kaggle notebook

Hello dtreeviz team,

I wanted to use your cool library on my kaggle notebook but it is failing installation:

WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fecd5b1f950>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/dtreeviz/
ERROR: Could not find a version that satisfies the requirement dtreeviz (from versions: none)
ERROR: No matching distribution found for dtreeviz

Here is a link to my public notebook: https://www.kaggle.com/georgezoto/intro-to-ml-underfitting-and-overfitting

Do you you have any ideas how to make this work in a kaggle notebook?

Thank you, George

compatibility

opened by georgezoto 20

Highlighting features used by prediction path

Is there any way to remove highlighted features used by prediction path or print them differently like in a table along with their importances ? Please see picture below - the features are quite smushed together and I would like to present them in a different manner or not present them at all.
viz-weirdness

opened by Rkubinski 20

Warning: No loadimage plugin for "svg:cairo"

Hello there,

I am trying to plot iris dataset in Ubuntu 18.04 bionic, with graphviz package installed. However I am getting svg:cairo warnings and there only arrows with empty boxes. The warnings that I am getting:

Warning: No loadimage plugin for "svg:cairo"
Warning: No loadimage plugin for "svg:cairo"
Warning: No loadimage plugin for "svg:cairo"
Warning: No loadimage plugin for "svg:cairo"
Warning: No loadimage plugin for "svg:cairo"
Warning: No loadimage plugin for "svg:cairo"
Warning: No loadimage plugin for "svg:cairo"
Warning: No loadimage plugin for "svg:cairo"

The figure looks like this: 2018-09-27-143948_4680x1920_scrot

cross-platform

opened by tdgunes 20

Lazy import of sklearn and xgboost

Importing xgboost just before using it is needed is useful on systems where it is not installed. Since installation of XGBoost can be tricky, I think this is a good idea.
compatibility

opened by yoavram 19
Warning: No loadimage plugin for "svg:cairo" in Mac
I have been trying to use dtreeviz to visualise iris data using my Mac.

#Iris data was loaded in an earlier part of the code from sklearn.datasets import * from sklearn import tree from dtreeviz.trees import * classifier = tree.DecisionTreeClassifier(max_depth=2) # limit depth of tree classifier.fit(iris.data, iris.target) viz = dtreeviz(classifier, iris.data, iris.target, target_name='variety', feature_names=iris.feature_names, class_names=["setosa", "versicolor", "virginica"] # need class_names for classifier ) viz.view()

Despite installing graphviz as:

brew install graphviz --with-librsvg --with-app --with-pango

I get the following output.

Solutions proposed in #4 did not work for me either. On downloading the zipped file in the thread and using dot -Tsvg t.dot > t.svg I could view the combined file. However, this did not work for me after deleting the contents of the folder III from the current directory.

Also, I noticed that the individual plots of only some of the nodes get stored in the location: "/private/var/folders/gs/r3f6yn8n4zj570qsj4s9rvnm0000gp/T/DTreeViz_19418" This included leaf1, leaf3, leaf4, node0, node2, legend and two other files named DTreeViz.
cross-platform
opened by Aru2612 19
Adapt utility function for `sklearn>=1.2.0`

This addresses #231 and refactors the setup.py a bit (open for discussion).

The feature extraction from sklearn-pipelines becomes much easier in version 1.2.0. I added a if-condition (similar as suggested by @tlapusan) that checks whether the pipeline allows for this new feature. However, in the long run, maintaining compatibility to all the sklearn versions will increase the code complexity. I recommend to add a lower-bound on the sklearn version at some point.

opened by windisch 2

Dimension issue in dtreeviz_sklearn_pipeline_visualisations.ipynb

@parrt @tlapusan Looks like there is a bug with extract_params_from_pipeline(). Try running dtreeviz_sklearn_pipeline_visualisations.ipynb in the dev branch.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 1
----> 1 tree_classifier, X_train, features_model = extract_params_from_pipeline(
      2     pipeline=model,
      3     X_train=dataset[features],
      4     feature_names=features)

File ~/dtreeviz/dtreeviz/utils.py:192, in extract_params_from_pipeline(pipeline, X_train, feature_names)
    186 tree_model = pipeline.steps[-1][1]
    188 feature_names = _extract_final_feature_names(
    189     pipeline=pipeline,
    190     features=feature_names
    191 )
--> 192 X_train = pd.DataFrame(
    193     data=pipeline[:-1].transform(X_train),
    194     columns=feature_names
    195 )
    196 return tree_model, X_train, feature_names

File ~/.venvs/dtreeviz/lib64/python3.11/site-packages/pandas/core/frame.py:721, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    711         mgr = dict_to_mgr(
    712             # error: Item "ndarray" of "Union[ndarray, Series, Index]" has no
    713             # attribute "name"
   (...)
    718             typ=manager,
    719         )
    720     else:
--> 721         mgr = ndarray_to_mgr(
    722             data,
    723             index,
    724             columns,
    725             dtype=dtype,
    726             copy=copy,
    727             typ=manager,
    728         )
    730 # For data is list-like, or Iterable (will consume into list)
    731 elif is_list_like(data):

File ~/.venvs/dtreeviz/lib64/python3.11/site-packages/pandas/core/internals/construction.py:349, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    344 # _prep_ndarraylike ensures that values.ndim == 2 at this point
    345 index, columns = _get_axes(
    346     values.shape[0], values.shape[1], index=index, columns=columns
    347 )
--> 349 _check_values_indices_shape_match(values, index, columns)
    351 if typ == "array":
    353     if issubclass(values.dtype.type, str):

File ~/.venvs/dtreeviz/lib64/python3.11/site-packages/pandas/core/internals/construction.py:420, in _check_values_indices_shape_match(values, index, columns)
    418 passed = values.shape
    419 implied = (len(index), len(columns))
--> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (891, 16), indices imply (891, 5)****

opened by mepland 2

font name/size not respected
in trees.py, the label font size arguments are ignoring but we should also be specifying the font.

def regr_leaf_node(node, label_fontsize: int = 12): ... def class_leaf_node(node, label_fontsize: int = 12): ...

It eventually calls the node_label() func:

<font face="Helvetica" ...>
bug
opened by parrt 4
Improve _ctreeviz_univar()
Compute bins only when needed

Fix crash for 0 count bins

Fix split heights changing in for loop

Make new 'preds' key in show param to separately control the horizontal prediction bars

Add ValueError for unrecognized gtype

Match other rectangle rect_edge color style

enhancement
opened by mepland 2

Boston dataset no longer avaliable in sklearn for classifier-decision-boundaries.ipynb and classifier-boundary-animations.ipynb

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

clean up

opened by mepland 3

Releases(2.0.0)

2.0.0(Dec 27, 2022)
This release re-organizes the API to focus on using a model adaptor that adapts the visualization library to the various supported decision tree libraries.

We simplified the README and rebuilt all of the library-specific notebooks to demonstrate the new API, using a common set of examples:

sklearn-based examples (colab)

LightGBM-based examples (colab)

Spark-based examples (colab)

TensorFlow-based examples (colab)

XGBoost-based examples (colab)

Classifier decision boundaries for any scikit-learn model.ipynb (colab)

Changing colors notebook (specific to sklearn) (colab)

New API:

from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier import dtreeviz iris = load_iris() X = iris.data y = iris.target clf = DecisionTreeClassifier(max_depth=4) clf.fit(X, y) viz_model = dtreeviz.model(clf, X_train=X, y_train=y, feature_names=iris.feature_names, target_name='iris', class_names=iris.target_names) v = viz_model.view() # render as SVG into internal object

Previous API: Previously, we did something like this to call functions and pass in the various details of the model and training data:

from dtreeviz.trees import dtreeviz dtreeviz(tree_model=clf, X_train, ...)

Using old functions with 2.0+:

For backward compatibility to call function dtreeviz() and the old API, you can change the import to be:

from dtreeviz import * dtreeviz(tree_model=clf, X_train, ...)

Argument name changes:

If you were previously using internal model adaptors, such as ShadowLightGBMTree, please note we have changed the following argument names: x_data->X_train and y_data->y_train.

Stuff we completed:

https://github.com/parrt/dtreeviz/milestone/30?closed=1
Source code(tar.gz)
Source code(zip)
1.4.1(Nov 27, 2022)

Fixing this bug: https://github.com/parrt/dtreeviz/issues/196
Source code(tar.gz)
Source code(zip)
1.4.0(Oct 22, 2022)

Add tensorflow decision tree support.
Source code(tar.gz)
Source code(zip)
1.3.7(Jul 8, 2022)
Add colors to support more than 10 target classes for classification:

https://github.com/parrt/dtreeviz/pull/185

Source code(tar.gz)
Source code(zip)
1.3.6(Apr 29, 2022)

This release makes a small change that really improve the speed of dtreeviz.interpretation.explain_prediction_plain_english().

See Fix: xgb_decision_tree shouldGoLeftAtSplit -- massive-speed-up #182.
Source code(tar.gz)
Source code(zip)
1.3.5(Mar 10, 2022)

Quick bug fix for scaling images.
Source code(tar.gz)
Source code(zip)
1.3.4(Mar 8, 2022)

Bug fix release: closed PRs
Source code(tar.gz)
Source code(zip)
1.3.3(Feb 9, 2022)

Minor tweaks and improvements.
Source code(tar.gz)
Source code(zip)
1.3.2(Nov 10, 2021)

Fix compatibility issue with 0.18 release of graphviz; fixed by 90031de435247632b75292e862e9697bb06479ca
Source code(tar.gz)
Source code(zip)
1.3.1(Sep 10, 2021)

Some bug fixes and clean up.
Source code(tar.gz)
Source code(zip)
1.2(Mar 16, 2021)

See Bugs fixes and pull requests.

The main improvement is the introduction of support for LightGBM, thanks to @tlapusan.
Source code(tar.gz)
Source code(zip)
1.1.4(Jan 27, 2021)

See issues addressed
Source code(tar.gz)
Source code(zip)
1.1.3(Nov 25, 2020)

Source code(tar.gz)
Source code(zip)
1.1.2(Oct 2, 2020)

Source code(tar.gz)
Source code(zip)
1.1.1(Sep 22, 2020)

Source code(tar.gz)
Source code(zip)
1.1(Sep 19, 2020)

Source code(tar.gz)
Source code(zip)
1.0(Jul 16, 2020)
Most important: Introduces XGBoost support thanks to @tlapusan

Enhancements

https://github.com/parrt/dtreeviz/pull/91

https://github.com/parrt/dtreeviz/pull/84

Bugs:

https://github.com/parrt/dtreeviz/pull/90

Source code(tar.gz)
Source code(zip)
0.8.2(Mar 22, 2020)

Source code(tar.gz)
Source code(zip)
0.8.1(Dec 5, 2019)

Quick enhancement: Add scale arg to dtreeviz()
Source code(tar.gz)
Source code(zip)
0.8(Dec 4, 2019)
Improvements

Display classification label on leaf node

Make leaf split plot for regressors to show distribution of leaf sample y's by Tudor Lapusan:

Show just path prediction by Tudor Lapusan

Fixes:

improve documentation on coloring doc enhancement

the right arrows in path to test data are not highlighted bug

Source code(tar.gz)
Source code(zip)
0.7.1(Oct 28, 2019)
Add histogram of leaf sizes with option to filter the leaves with less than min_samples or more than max_samples.

Added more markersize args

don't show horizontal bars indicating prediction if no splits requested

add new color settings:

'scatter_marker_alpha': 0.5, 'tesselation_alpha': 0.3, 'tesselation_alpha_3D' : 0.5,
Source code(tar.gz)
Source code(zip)
0.7(Oct 23, 2019)
Added cool leaf viz features from @tlapusan viz_leaf_samples() and ctreeviz_leaf_samples.

Also, allowed ax=None now in the rtree_* and ctree_* functions.

Fixed:

indexing wrt class_names & relevant issues

added type check and conversion for y_train parameter

the right arrows in path to test data are not highlighted

Source code(tar.gz)
Source code(zip)
0.5(Jul 14, 2019)
See changes for 0.5. Thanks to @Ashafix and @H4dr1en for the PRs:

Configurable colors

Added colorbar

code clean up

Source code(tar.gz)
Source code(zip)
0.4(May 27, 2019)

Lots of bug fixes but main improvement is drastically simplified install procedure for OS X and Windows, thanks to @gautamkarnik. Also, I merged @h4nyu's PR for remove hard coded fontname.
Source code(tar.gz)
Source code(zip)
0.3(Nov 27, 2018)

Regression univariate feature-target space

Regression bivariate feature-target space

Regression bivariate feature-target space heatmap

Classification univariate feature-target space

Classification bivariate feature-target space

Source code(tar.gz)
Source code(zip)
0.2.2(Oct 6, 2018)

Source code(tar.gz)
Source code(zip)
0.2.1(Oct 4, 2018)

Just a tweak to fix path issue for windows. thanks to @whamp.
Source code(tar.gz)
Source code(zip)
0.2(Oct 2, 2018)
Enhancements:

Squashed all commits prior to these updates to remove huge legacy files

Works across mac, linux, windows 10. Only .svg output on linux, windows

Leaf nodes respect class weights now in classifier trees, thanks to @dschoenleber

SVG files on all platforms embed images for nodes, rather than linking to them. We had to do this with custom XML munging code, except on OS X where we can do -Tsvg:cairo on dot command

Legend on classifiers now done with mathplotlib not graphviz

Source code(tar.gz)
Source code(zip)
0.1(Sep 29, 2018)

I squashed commit history due to huge set of dead files. renamed from animl to detreeviz
Source code(tar.gz)
Source code(zip)