Happens when training on batched data with warm_start = True and the data is unbalanced.
Error:
/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: **Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes)** is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order, subok=True)
Traceback (most recent call last):
ml-pipeline/src/treeint_simple_example.py", line 22, in <module>
test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))
File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 212, in predict
return _predict_forest(model, X, joint_contribution=joint_contribution)
File "/Users/x/anaconda3/lib/python3.7/site-packages/treeinterpreter/treeinterpreter.py", line 166, in _predict_forest
return (np.mean(predictions, axis=0), np.mean(biases, axis=0),
File "<__array_function__ internals>", line 6, in mean
File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 3373, in mean
out=out, **kwargs)
File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 144, in _mean
arr = asanyarray(a)
File "/Users/x/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
**ValueError: could not broadcast input array from shape (2,1) into shape (2)**
Reproduction:
from sklearn.ensemble import RandomForestClassifier
from treeinterpreter import treeinterpreter as ti
import pandas as pd
# Random forest that can train on chunks of data.
rf = RandomForestClassifier(warm_start=True, n_estimators=1)
# data of chunk1
chunk1_data_vec = [0, 0]
chunk1_df = pd.DataFrame(data={'label': chunk1_data_vec, 'features1': chunk1_data_vec, 'features2': chunk1_data_vec})
# data of chunk2
chunk2_data_vec = [0, 0, 1, 1, 0, 0, 1, 1]
chunk2_df = pd.DataFrame(data={'label': chunk2_data_vec, 'features1': chunk2_data_vec, 'features2': chunk2_data_vec})
# fit first chunk of data that has a single label
rf.fit(X=chunk1_df.drop(['label'], axis='columns'), y=chunk1_df['label'])
# fit second chunk of data that has 2 labels
rf.n_estimators += 1
rf.fit(X=chunk2_df.drop(['label'], axis='columns'), y=chunk2_df['label'])
# test
test_data = chunk2_df.drop(['label'], axis='columns')
# regular predict
rf.predict_proba(test_data)
# tree interpreter predict
test_predict_prob, bias, contributions = ti.predict(rf, test_data.head(2))