CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms

Overview

PyPI - Python Version GitHub Workflow Status Read the Docs Code style: black

CARLA - Counterfactual And Recourse Library

CARLA is a python library to benchmark counterfactual explanation and recourse models. It comes out-of-the box with commonly used datasets and various machine learning models. Designed with extensibility in mind: Easily include your own counterfactual methods, new machine learning models or other datasets.

Find extensive documentation here! Our arXiv paper can be found here.

Available Datasets

Implemented Counterfactual Methods

  • Actionable Recourse (AR): Paper
  • CCHVAE: Paper
  • Contrastive Explanations Method (CEM): Paper
  • Counterfactual Latent Uncertainty Explanations (CLUE): Paper
  • CRUDS: Paper
  • Diverse Counterfactual Explanations (DiCE): Paper
  • Feasible and Actionable Counterfactual Explanations (FACE): Paper
  • Growing Sphere (GS): Paper
  • Revise: Paper
  • Wachter: Paper

Provided Machine Learning Models

  • ANN: Artificial Neural Network with 2 hidden layers and ReLU activation function
  • LR: Linear Model with no hidden layer and no activation function

Which Recourse Methods work with which ML framework?

The framework a counterfactual method currently works with is dependent on its underlying implementation. It is planned to make all recourse methods available for all ML frameworks . The latest state can be found here:

Recourse Method Tensorflow Pytorch
Actionable Recourse X X
CCHVAE X
CEM X
CLUE X
CRUDS X
DiCE X X
FACE X X
Growing Spheres X X
Revise X
Wachter X

Installation

Requirements

  • python3.7
  • pip

Install via pip

pip install carla-recourse

Usage Example

from carla import DataCatalog, MLModelCatalog
from carla.recourse_methods import GrowingSpheres

# load a catalog dataset
data_name = "adult"
dataset = DataCatalog(data_name)

# load artificial neural network from catalog
model = MLModelCatalog(dataset, "ann")

# get factuals from the data to generate counterfactual examples
factuals = dataset.raw.iloc[:10]

# load a recourse model and pass black box model
gs = GrowingSpheres(model)

# generate counterfactual examples
counterfactuals = gs.get_counterfactuals(factuals)

Contributing

Requirements

  • python3.7-venv (when not already shipped with python3.7)
  • Recommended: GNU Make

Installation

Using make:

make requirements

Using python directly or within activated virtual environment:

pip install -U pip setuptools wheel
pip install -e .

Testing

Using make:

make test

Using python directly or within activated virtual environment:

pip install -r requirements-dev.txt
python -m pytest test/*

Linting and Styling

We use pre-commit hooks within our build pipelines to enforce:

  • Python linting with flake8.
  • Python styling with black.

Install pre-commit with:

make install-dev

Using python directly or within activated virtual environment:

pip install -r requirements-dev.txt
pre-commit install

Licence

carla is under the MIT Licence. See the LICENCE for more details.

Citation

This project was recently accepted to NeurIPS 2021 (Benchmark & Data Sets Track). If you use this codebase, please cite:

@misc{pawelczyk2021carla,
      title={CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms},
      author={Martin Pawelczyk and Sascha Bielawski and Johannes van den Heuvel and Tobias Richter and Gjergji Kasneci},
      year={2021},
      eprint={2108.00783},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
Comments
  • Discrepancy in counterfactual indexing for CLUE generator

    Discrepancy in counterfactual indexing for CLUE generator

    Hello!

    When generating counterfactuals using CLUE counterfactual generator, the resulting counterfactuals in the dataframe are ordered using a RangeIndex instead of the original indexing found in the factuals. This is a problem when the factuals' indices are not ordered as a range index, e.x. by using .sample(n). This can be seen by running this code:

    data_name = "compas"
    model = MLModelCatalog(dataset, "ann", backend="pytorch")
    model.train(...)
    
    hyperparams = {...}
    cl = Clue(dataset, model, hyperparams)
    
    wa = Wachter(model, {...})
    
    factuals = predict_negative_instances(model, dataset._df).sample(10)
    
    cl_counterfactuals = cl.get_counterfactuals(factuals)
    wa_counterfactuals = wa.get_counterfactuals(factuals)
    
    display(factuals.index)
    display(cl_counterfactuals.index)
    display(wa_counterfactuals.index)
    

    This yields:

    Int64Index([4886, 4389, 2317, 797, 3154, 4685, 956, 3014, 99, 510], dtype='int64')
    RangeIndex(start=0, stop=10, step=1)
    Int64Index([4886, 4389, 2317, 797, 3154, 4685, 956, 3014, 99, 510], dtype='int64')
    

    Now, if we try to benchmark the CLUE counterfactual generator:

    benchmark = Benchmark(model, cl, factuals)
    benchmark.run_benchmark()
    

    we get:

    ValueError: Can only compare identically-labeled DataFrame objects
    

    Stemming from the constraint_violation check in carla\evaluation\violations.py

    The counterfactuals get reordered in counterfactuals.check_counterfactuals for CLUE, FACE, GrowingSpheres and REVISE since they all pass in a list of counterfactuals as opposed to a pandas dataframe, so at least those methods can have problems with that.

    One way of fixing this issue would be to pass an index attribute to check_counterfactuals and add a index=indices on line 32 to set the proper indices. Other would be to simply do cfs_df.index = factuals.index after the function is called to reorder the indices.

    Since the counterfactuals now get new indices, and both dataframes are of the same size, restoring the ones from factuals shouldn't cause any further problems I can spot and will only improve the consistency across the recourse methods.

    If any of the proposed fixes are acceptable, or there are other ways to fix the issue, I would be happy to perform them and be assigned to this issue. :)

    opened by drobiu 12
  • Strange sparsity results

    Strange sparsity results

    Hi!

    I've noticed some potentially wrong sparisty (L0/Distance_1) results due to some very small numbers.

    When running Carla's benchmarking code for the cem-vae method and the first 10 test observations:

    from carla.data.catalog import OnlineCatalog
    from carla.models.catalog import MLModelCatalog
    from carla.models.negative_instances import predict_negative_instances
    from carla import Benchmark
    
    import carla.recourse_methods.catalog as recourse_catalog
    
    import torch
    
    dataset = OnlineCatalog("adult")
    
    torch.manual_seed(0)
    n_test = 10
    ml_model = MLModelCatalog(
            dataset, 
            model_type="ann", 
            load_online=False, 
            backend="pytorch"
        )
    
    ml_model.train(
        learning_rate=0.002,
        epochs=20,
        batch_size=1024,
        hidden_size=[18, 9, 3],
        force_train=True, 
    )
    
    hyperparams = {
        "data_name": "adult",
        "batch_size": 1,
        "kappa": 0.1,
        "init_learning_rate": 0.01,
        "binary_search_steps": 9,
        "max_iterations": 100,
        "initial_const": 10,
        "beta": 0.9,
        "gamma": 1.0, # 0.0, #   1.0
        "mode": "PN",
        "num_classes": 2,
        "ae_params": {"hidden_layer": [20, 10, 7], "train_ae": True, "epochs": 5},
    }
    
    from tensorflow import Graph, Session
    
    graph = Graph()
    with graph.as_default():
        ann_sess = Session()
        with ann_sess.as_default():
            ml_model_sess = MLModelCatalog(dataset, "ann", "tensorflow")
    
            factuals_sess = predict_negative_instances(
                ml_model_sess, dataset.df
            )
            factuals_sess = factuals_sess.iloc[:n_test].reset_index(drop=True)
    
            cem = recourse_catalog.CEM(ann_sess, ml_model_sess, hyperparams)
            df_cfs = cem.get_counterfactuals(factuals_sess)
            benchmark = Benchmark(ml_model, cem, factuals_sess)
    
    distances = benchmark.compute_distances()
    
    distances.Distance_1[0] # equal to 5
    

    I get that the first sparsity/Distance_1 is equal to 5. When printing our the factual and counterfactual for this test observation, I get that the two vectors are almost the same (the only difference is 'capital-gain').

    image

    The reason for this problem is that the distance_1 code looks something like this

    import numpy as np
    
    arr_f = ml_model.get_ordered_features(benchmark._factuals).to_numpy()
    arr_cf = ml_model.get_ordered_features(
        benchmark._counterfactuals
    ).to_numpy()
    
    delta = arr_f - arr_cf
    
    d1 = np.sum(delta != 0, axis=1, dtype=np.float).tolist()
    

    For the first observation, delta (the difference between the factual and the counterfactual) has really small (but not zero) numbers:

    image

    Which leads to a wrong calculation of d1.

    Any suggestions on how to fix this delta/rounding problem?

    Thanks!

    opened by aredelmeier 8
  • Issue running REVISE

    Issue running REVISE

    Hi! Thanks for all the your work implementing a large range of counterfactual explanation methods!

    I'm having problems running the Revise method and was hoping someone could point me in the right direction. Here is my code (I'm using the CARLA package master branch pushed April 20th, 2022):

    
    from carla.data.catalog import OnlineCatalog
    from carla.models.catalog import MLModelCatalog
    from carla.models.negative_instances import predict_negative_instances
    import carla.recourse_methods.catalog as recourse_catalog
    
    data_name = "adult"
    dataset = OnlineCatalog(data_name)
    
    training_params = {"lr": 0.002, "epochs": 10, "batch_size": 1024, "hidden_size": [18, 9, 3]}
    
    ml_model = MLModelCatalog(
        dataset, 
        model_type="ann", 
        load_online=False, 
        backend="pytorch"
    )
    
    ml_model.train(
        learning_rate=training_params["lr"],
        epochs=training_params["epochs"],
        batch_size=training_params["batch_size"],
        hidden_size=training_params["hidden_size"]
    )
    
    factuals = predict_negative_instances(ml_model, dataset.df)
    test_factual = factuals.iloc[:5]
    
    hyperparams = {
         "data_name": "adult",
            "lambda": 0.5,
            "optimizer": "adam",
            "lr": 0.1,
            "max_iter": 1000,
            "target_class": [0, 1],
            "binary_cat_features": True,
            "vae_params": {
                "layers": [13, 512, 256, 8],
                "train": True,
                "lambda_reg": 1e-6,
                "epochs": 5,
                "lr": 1e-3,
                "batch_size": 32,
            },
    }
    
    # This runs:
    revise = recourse_catalog.Revise(ml_model, dataset, hyperparams)
    
    # This gives me an error:
    df_cfs = revise.get_counterfactuals(test_factual)
    

    The last line of code gives me an IndexError: index 13 is out of bounds for dimension 1 with size 13. Any idea why this may be?

    Thanks, Annabelle

    opened by aredelmeier 8
  • Wachter et al. Couterfactuals

    Wachter et al. Couterfactuals

    Hello everyone,

    I was trying to find here the implementation of the simple counterfactuals proposed by Wachter et al.? Could someone please help me in which of the files it is coded?

    Thank you very very much! Cheers, Nina

    opened by ninaspreitzer 6
  • CCHVAE's support for categorical features

    CCHVAE's support for categorical features

    Hello,

    Many thanks for this great library.

    I have a question: Does the current implementation of CCHVAE support categorical features?

    If so, how should I set the number of nodes for the first layer (input layer) of the VAE (I mean for the vae_params['layers']? Should I set this value as the input length after one-hot encoding is applied?

    opened by ah-ansari 5
  • Unusual results for REVISE/C-CHVAE/CRUD when fixing features

    Unusual results for REVISE/C-CHVAE/CRUD when fixing features

    Hi!

    I ran all the dependent based methods using the CARLA master branch from April 22nd, 2022 for both the Adult and GMC data sets using an ANN with [18, 9, 3] layers, 20 epochs, and 1024 batch size (100 test observations) (AUC 0.90 and 0.86, respectively). Adult fixed 'age' and 'sex' and GMC fixed only 'age'. The hyperparameters for the dependent methods were specified as in https://github.com/carla-recourse/CARLA/blob/main/experimental_setup.yaml.

    I used the Benchmarking code to test the violation/L0/L1/L2 metrics and was surprised by the results (these are the mean of violation/L0/L1/L2 across 100 test observations):

    carla_results

    From my understanding, REVISE, C-CHVAE og CRUD methods all take into account fixed features. However, these three results had quite a large (and certainly not zero) violation rate.

    In addition, C-CHVAE seems to produce almost identical counterfactuals for different test observations. This table image

    is from https://github.com/carla-recourse/CARLA/blob/chore/update_documentation/docs/source/notebooks/how_to_use_carla.ipynb (and I produced something similar) shows just how similar!

    My question is whether you have experienced the same phenomenom. Do you have some results for REVISE/C-CHVAE/CRUD where the violation rate is close to zero or where C-CHVAE produces not identical counterfactuals?

    Any help would be appreciated!

    Annabelle

    opened by aredelmeier 5
  • Question on example for adding a recourse method

    Question on example for adding a recourse method

    Hi there!

    I have a few (3) questions about how to add a new recourse method

    1. In your notebook example, there is:
    def get_counterfactuals(self, factuals: pd.DataFrame):
          # this property is responsible to generate and output
          # encoded and scaled counterfactual examples
          # as pandas DataFrames
          return counterfactual_examples
    

    I wonder what you mean exactly by encoded and scaled. Does that mean that they should follow the same encoding of factuals? Moreover, should there be exactly 1 counterfactual example per given factual? (I assume that factuals is a collection of points for which a counterfactual is needed).

    1. I see you test the counterfactuals according to 4 distance functions. Do we have info on what distance function is used when the get_counterfactuals is called? I can imagine that you'd wish your recourse method to optimize for the distance that is ultimately used for evaluation.

    2. Is there a way for the recourse method to know what is the range of variability of a feature? E.g., min and max for numerical features based on the training set, and the categorical possibilities for categorical features. Otherwise, I can imagine the black box model could be given an invalid input while searching for the counterfactuals (a too high or too small number or a category that does not exist).

    Forgive me if this info is explained somewhere else and I missed it, in which case I'd kindly ask you to point me to it.

    opened by marcovirgolin 5
  • Question about running the example code

    Question about running the example code

    Thank you for all the work you put into the CARLA package. It's a great help when trying to compare different counterfactual methods! My question is about running the example code.

    The part of posted example code is:

    from carla import DataCatalog, MLModelCatalog from carla.recourse_methods import GrowingSpheres

    load a catalog dataset

    data_name = "adult" dataset = DataCatalog(data_name)

    When I run the line "dataset = DataCatalog(data_name)", I got the error below: TypeError: Can't instantiate abstract class DataCatalog with abstract methods categorical, continuous, immutables, target

    I am wondering do you know what is happening there or any way to solve that?

    Appreciate your time in advance and really looking forward to your reply!

    Best, Wenting

    opened by wqi131206 5
  • predict_negative_instances normalization can result in double normalization if model use_pipeline is True

    predict_negative_instances normalization can result in double normalization if model use_pipeline is True

    predict_negative_instances calls 'predict_label' which normalized the input, and calls 'model.predict'. If 'model.use_pipeline==True' then 'model.predict' again normalizes the data, resulting in double normalization.

    opened by JohanvandenHeuvel 5
  • MLModelCatalog predict method incompatible with pipeline

    MLModelCatalog predict method incompatible with pipeline

    The MLModelCatalog predict method has the following signature:

    def predict(
        self, x: Union[np.ndarray, pd.DataFrame, torch.Tensor, tf.Tensor]
    ) -> Union[np.ndarray, pd.DataFrame, torch.Tensor, tf.Tensor]:
    

    however if the MLModelCatalog pipeline is enabled then x is also input for

    def perform_pipeline(self, df: pd.DataFrame) -> pd.DataFrame:
    

    i.e. the predict function can take input types that are incompatible with the possible model settings.

    opened by JohanvandenHeuvel 5
  • Chore/make gpu usable

    Chore/make gpu usable

    I'm not an expert with pytorch and cuda but I did my best to add the .to(device) where necessary. CCHVAE and REViSE were much easier than CRUD (you'll see only a couple changes to vae.py, cchvae/model.py and revise/model.py). On the other hand, CRUD required a lot more work and there are maybe some redundant .to(device) (e.g., in losses.py or csvae.py) but at least the code works.

    Since you don't have a GPU, might be difficult to test this out but perhaps you could check that the CPU version works?

    opened by aredelmeier 4
  • Update benchmarking example

    Update benchmarking example

    I installed the package as recommend using pip install carla-recourse The version installed is carla-recourse==0.0.5 but I am not able to reproduce the benchmarking example. The error happen when importing evaluation module: Screen Shot 2022-10-04 at 11 27 42

    opened by jscanass 0
  • Small errors with new REViSE, CCHVAE, and CRUD code

    Small errors with new REViSE, CCHVAE, and CRUD code

    Hi,

    Since the latest release of CARLA, I think there are some errors that have popped up.

    The biggest problem is that (I believe) the dimension of the input layer of the VAE has to be adjusted for REViSE, CCHVAE, and CRUD if the immutable_mask contains at least one TRUE.

    In addition, when running the methods on a GPU, there is some code that has to be adjusted (so far, I've only found problems for these three methods but I haven't tested all of them).

    I'm not an expert on these methods, but can I do a pull request where I make changes to these methods such that they run again (on GPU)? Thanks,

    Annabelle

    opened by aredelmeier 5
Owner
Carla Recourse
Carla Recourse
RL and distillation in CARLA using a factorized world model

World on Rails Learning to drive from a world on rails Dian Chen, Vladlen Koltun, Philipp Krähenbühl, arXiv techical report (arXiv 2105.00636) This re

Dian Chen 131 Dec 16, 2022
A generalized framework for prototyping full-stack cooperative driving automation applications under CARLA+SUMO.

OpenCDA OpenCDA is a SIMULATION tool integrated with a prototype cooperative driving automation (CDA; see SAE J3216) pipeline as well as regular autom

UCLA Mobility Lab 726 Dec 29, 2022
The CLRS Algorithmic Reasoning Benchmark

Learning representations of algorithms is an emerging area of machine learning, seeking to bridge concepts from neural networks with classical algorithms.

DeepMind 251 Jan 5, 2023
[CVPR 2021] Released code for Counterfactual Zero-Shot and Open-Set Visual Recognition

Counterfactual Zero-Shot and Open-Set Visual Recognition This project provides implementations for our CVPR 2021 paper Counterfactual Zero-S

null 144 Dec 24, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 90 Dec 31, 2022
[ICLR'21] Counterfactual Generative Networks

This repository contains the code for the ICLR 2021 paper "Counterfactual Generative Networks" by Axel Sauer and Andreas Geiger. If you want to take the CGN for a spin and generate counterfactual images, you can try out the Colab below.

null 88 Jan 2, 2023
[CVPR 2021] Counterfactual VQA: A Cause-Effect Look at Language Bias

Counterfactual VQA (CF-VQA) This repository is the Pytorch implementation of our paper "Counterfactual VQA: A Cause-Effect Look at Language Bias" in C

Yulei Niu 94 Dec 3, 2022
Implementation of the paper "Shapley Explanation Networks"

Shapley Explanation Networks Implementation of the paper "Shapley Explanation Networks" at ICLR 2021. Note that this repo heavily uses the experimenta

null 68 Dec 27, 2022
This repository contains the implementation of the paper: "Towards Frequency-Based Explanation for Robust CNN"

RobustFreqCNN About This repository contains the implementation of the paper "Towards Frequency-Based Explanation for Robust CNN" arxiv. It primarly d

Sarosij Bose 2 Jan 23, 2022
The code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning"

The Code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning" Setting up and using the repo Get the dataset. Follow

null 4 Apr 20, 2022
Open-L2O: A Comprehensive and Reproducible Benchmark for Learning to Optimize Algorithms

Open-L2O This repository establishes the first comprehensive benchmark efforts of existing learning to optimize (L2O) approaches on a number of proble

VITA 161 Jan 2, 2023
Benchmark spaces - Benchmarks of how well different two dimensional spaces work for clustering algorithms

benchmark_spaces Benchmarks of how well different two dimensional spaces work fo

Bram Cohen 6 May 7, 2022
GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

null 22 Dec 12, 2022
A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

null 195 Dec 7, 2022
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022
Algorithmic trading using machine learning.

Algorithmic Trading This machine learning algorithm was built using Python 3 and scikit-learn with a Decision Tree Classifier. The program gathers sto

Sourav Biswas 101 Nov 10, 2022
High frequency AI based algorithmic trading module.

Flow Flow is a high frequency algorithmic trading module that uses machine learning to self regulate and self optimize for maximum return. The current

null 59 Dec 14, 2022
Algorithmic trading with deep learning experiments

Deep-Trading Algorithmic trading with deep learning experiments. Now released part one - simple time series forecasting. I plan to implement more soph

Alex Honchar 1.4k Jan 2, 2023