Behavioral "black-box" testing for recommender systems

Overview

RecList

Documentation Status Contributors License Downloads

RecList

Overview

RecList is an open source library providing behavioral, "black-box" testing for recommender systems. Inspired by the pioneering work of Ribeiro et al. 2020 in NLP, we introduce a general plug-and-play procedure to scale up behavioral testing, with an easy-to-extend interface for custom use cases.

RecList ships with some popular datasets and ready-made behavioral tests: check the paper for more details on the relevant literature and the philosophical motivations behind the project.

If you are not familiar with the library, we suggest first taking our small tour to get acquainted with the main abstractions through ready-made models and public datasets.

Quick Links

  • Our paper, with in-depth analysis, detailed use cases and scholarly references.
  • A colab notebook (WIP), showing how to train a cart recommender model from scratch and use the library to test it.
  • Our blog post (forthcoming), with examples and practical tips.

Project updates

Nov. 2021: the library is currently in alpha (i.e. enough working code to finish the paper and tinker with it). We welcome early feedback, but please be advised that the package may change substantially in the upcoming months ("If you're not embarrassed by the first version, you've launched too late").

As the project is in active development, come back often for updates.

Summary

This doc is structured as follows:

Quick Start

If you want to see RecList in action, clone the repository, create and activate a virtual env, and install the required packages from root. If you prefer to experiment in an interactive, no-installation-required fashion, try out our colab notebook.

Sample scripts are divided by use-cases: similar items, complementary items or session-based recommendations. When executing one, a suitable public dataset will be downloaded, and a baseline ML model trained: finally, the script will run a pre-made suite of behavioral tests to show typical results.

git clone https://github.com/jacopotagliabue/reclist
cd reclist
python3 -m venv venv
source venv/bin/activate
pip install -e .
python examples/coveo_complementary_rec.py

Running your model on one of the supported dataset, leveraging the pre-made tests, is as easy as implementing a simple interface, RecModel.

Once you've run successfully the sample script, take the guided tour below to learn more about the abstractions and the out-of-the-box capabilities of RecList.

A Guided Tour

An instance of RecList represents a suite of tests for recommender systems: given a dataset (more appropriately, an instance of RecDataset) and a model (an instance of RecModel), it will run the specified tests on the target dataset, using the supplied model.

For example, the following code instantiates a pre-made suite of tests that contains sensible defaults for a cart recommendation use case:

rec_list = CoveoCartRecList(
    model=model,
    dataset=coveo_dataset
)
# invoke rec_list to run tests
rec_list(verbose=True)

Our library pre-packages standard recSys KPIs and important behavioral tests, divided by use cases, but it is built with extensibility in mind: you can re-use tests in new suites, or you can write new domain-specific suites and tests.

Any suite must inherit the RecList interface, and then declare with Pytonic decorators its tests: in this case, the test re-uses a standard function:

class MyRecList(RecList):

    @rec_test(test_type='stats')
    def basic_stats(self):
        """
        Basic statistics on training, test and prediction data
        """
        from reclist.metrics.standard_metrics import statistics
        return statistics(self._x_train,
            self._y_train,
            self._x_test,
            self._y_test,
            self._y_preds)

Any model can be tested, as long as its predictions are wrapped in a RecModel. This allows for pure "black-box" testings, a SaaS provider can be tested just by wrapping the proper API call in the method:

class MyCartModel(RecModel):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def predict(self, prediction_input: list, *args, **kwargs):
        """
        Implement the abstract method, accepting a list of lists, each list being
        the content of a cart: the predictions returned by the model are the top K
        items suggested to complete the cart.
        """

        return

While many standard KPIs are available in the package, the philosophy behind RecList is that metrics like Hit Rate provide only a partial picture of the expected behavior of recommenders in the wild: two models with very similar accuracy can have very different behavior on, say, the long-tail, or model A can be better than model B overall, but at the expense of providing disastrous performance on a set of inputs that are particularly important in production.

RecList recognizes that outside of academic benchmarks, some mistakes are worse than others, and not all inputs are created equal: when possible, it tries to operationalize through scalable code behavioral insights for debugging and error analysis; it also provides extensible abstractions when domain knowledge and custom logic are needed.

Once you run a suite of tests, results are dumped automatically and versioned in a local folder, structured as follows (name of the suite, name of the model, run timestamp):

.reclist/
  myList/
    myModel/
      1637357392/
      1637357404/

We provide a simple (and very WIP) UI to easily compare runs and models. After you run two times one of the example scripts, you can do:

cd app
python app.py

to start a local web app that lets you explore test results:

https://github.com/jacopotagliabue/reclist/blob/main/images/explorer.png

If you select more than model, the app will automatically build comparison tables:

https://github.com/jacopotagliabue/reclist/blob/main/images/comparison.png

If you start using RecList as part of your standard testings - either for research or production purposes - you can use the JSON report for machine-to-machine communication with downstream system (e.g. you may want to automatically fail the model pipeline if certain behavioral tests are not passed).

Capabilities

RecList provides a dataset and model agnostic framework to scale up behavioral tests. As long as the proper abstractions are implemented, all the out-of-the-box components can be re-used. For example:

  • you can use a public dataset provided by RecList to train your new cart recommender model, and then use the RecTests we provide for that use case;
  • you can use some baseline model on your custom dataset, to establish a baseline for your project;
  • you can use a custom model, on a private dataset and define from scratch a new suite of tests, mixing existing methods and domain-specific tests.

We list below what we currently support out-of-the-box, with particular focus on datasets and tests, as the models we provide are convenient baselines, but they are not meant to be SOTA research models.

Datasets

RecList features convenient wrappers around popular datasets, to help test models over known benchmarks in a standardized way.

Behavioral Tests

Coming soon!

Roadmap

To do:

  • the app is just a stub: improve the report "contract" and extend the app capabilities, possibly including it in the library itself;
  • continue adding default RecTests by use cases, and test them on public datasets;
  • improving our test suites and refactor some abstractions;
  • adding Colab tutorials, extensive documentation and a blog-like write-up to explain the basic usage.

We maintain a small Trello board on the project which we plan on sharing with the community: more details coming soon!

Contributing

We will update this repo with some guidelines for contributions as soon as the codebase becomes more stable. Check back often for updates!

Acknowledgments

The main contributors are:

If you have questions or feedback, please reach out to: jacopo dot tagliabue at tooso dot ai.

License and Citation

All the code is released under an open MIT license. If you found RecList useful, or you are using it to benchmark/debug your model, please cite our pre-print (forhtcoming):

@inproceedings{Chia2021BeyondNB,
  title={Beyond NDCG: behavioral testing of recommender systems with RecList},
  author={Patrick John Chia and Jacopo Tagliabue and Federico Bianchi and Chloe He and Brian Ko},
  year={2021}
}

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Comments
  • Reclist tutorial - Hit Rate by Brand graph

    Reclist tutorial - Hit Rate by Brand graph

    Description

    I ran through the provided RecList Colab Notebook tutorial. I also read the blog on RecList.

    In the blog post, it shows the Hit Rate @ 10 by brand. Picture below.

    image

    I did not see how to generate the HR@10 by brand in the Colab Notebook example. Would you mind sharing the code on how to generate that please?

    Thank you

    opened by kmcmearty 1
  • How to pick use case clarification

    How to pick use case clarification

    Description

    When I pick a use case from Reclist and I want to build off of one of them, eg complementary items, does that mean I would instantiate the CoveoCartRecList class?

    From the Colab notebook tutorial:

    `In particular in this tutorial, we are targeting a "complementary items" use cases, such as for example a cart recommender:

    if a shopper added item X to the cart, what is she likely to add next? We train a simple, yet effective prod2vec baseline model, re-using for convenience a "training embedding" function already implemented by recsys. `

    For example, say I'm interested in the similar item or session based use cases from RecList, would I call MovieLensSimilarItemRecList or SpotifySessionRecList() to start my build from?

    I'm trying to get some clarification on the right way to use this library from the blog post and colab notebook when it says "Step #1 Pick a use case"

    Thank you

    opened by kmcmearty 1
  • may you share y_train  data for CoveoDataset  please

    may you share y_train data for CoveoDataset please

    may you share data for CoveoDataset please as we can see y_train is not provided https://github.com/jacopotagliabue/reclist/blob/main/reclist/datasets.py

        self._x_train = data["x_train"]
        self._y_train = None
        self._x_test = data["x_test"]
        self._y_test = data["y_test"]
        self._catalog = data["catalog"]
    

    x_train can be downloaded len(x_train) 927357 but y_train not

    opened by Sandy4321 1
  • Update precision_at_k

    Update precision_at_k

    Description

    Line 123 in 'reclist/reclist/metrics/standard_metrics.py'(url) Is precision@k is the percentage of the relevant documents that is viewed? Is that the definition followed here?

    def precision_at_k(y_preds, y_test, k=3):
        precision_ls = [len(set(_y).intersection(set(_p[:k]))) / len(_p) if _p else 1 for _p, _y in zip(y_preds, y_test)]
        return np.average(precision_ls)
    

    should not the len(_p) be len(_p[:k]) or the precision@k is defined is some other way? This is what I think should be precision@k if we follow the standard definition

    
    def precision_at_k(y_preds, y_test, k=3):
        precision_ls = [len(set(_y).intersection(set(_p[:k]))) / len(_p[:k]) if _p else 1 for _p, _y in zip(y_preds, y_test)]
        return np.average(precision_ls)
    
    opened by nsbits 0
  • Datalab 4426: Add BBC Sounds datasets and models

    Datalab 4426: Add BBC Sounds datasets and models

    This PR adds classes and methods to load a BBC Sounds dataset from GCP; it also adds a dummy lightFM model, which loads pre-generated predictions, and functions to generate metrics by custom (age and gender) groups).

    opened by alepiscopo 0
  • Feature Requests: Smaller Datasets as example.

    Feature Requests: Smaller Datasets as example.

    • RecList version: 0.3.1
    • Python version: 3.11
    • Operating System: Linux/Windows

    Description

    Example datasets are really large. They tend to make the system run out of memory. Can we have smaller example datasets to run the tests?

    enhancement 
    opened by unna97 0
  • Issue with Reclist package pip install.

    Issue with Reclist package pip install.

    • RecList version:
    • Python version: 3.10
    • Operating System: Windows

    Description

    Failed to install reclist on the local machine (windows) and GitHub codespaces (Linux). Runs into an issue with matplotlib and gensim.

    What I Did

    Upgrade the versions in requirements.txt

    !pip install reclist
    
    opened by unna97 1
  • Update popularity bias at k

    Update popularity bias at k

    Description

    Line 118 in 'reclist/reclist/metrics/standard_metrics.py'(url) The denominator should be len(p[:k)

    def popularity_bias_at_k(y_preds, x_train, k=3):
        # estimate popularity from training data
        pop_map = collections.defaultdict(lambda : 0)
        num_interactions = 0
        for session in x_train:
            for event in session:
                pop_map[event] += 1
                num_interactions += 1
        # normalize popularity
        pop_map = {k:v/num_interactions for k,v in pop_map.items()}
        all_popularity = []
        for p in y_preds:
            average_pop = sum(pop_map.get(_, 0.0) for _ in p[:k]) / len(p) if len(p) > 0 else 0
            all_popularity.append(average_pop)
        return sum(all_popularity) / len(y_preds)
    

    should not the len(_p) be len(_p[:k]) ? we will be looking till the kth slice This is what I think should be popularity_bias@k

    
    def popularity_bias_at_k(y_preds, x_train, k=3):
        # estimate popularity from training data
        pop_map = collections.defaultdict(lambda : 0)
        num_interactions = 0
        for session in x_train:
            for event in session:
                pop_map[event] += 1
                num_interactions += 1
        # normalize popularity
        pop_map = {k:v/num_interactions for k,v in pop_map.items()}
        all_popularity = []
        for p in y_preds:
            average_pop = sum(pop_map.get(_, 0.0) for _ in p[:k]) / len(p[:k]) if len(p) > 0 else 0
            all_popularity.append(average_pop)
        return sum(all_popularity) / len(y_preds)
    
    opened by nsbits 0
Owner
Jacopo Tagliabue
I failed the Turing Test once, but that was many friends ago.
Jacopo Tagliabue
Attack classification models with transferability, black-box attack; unrestricted adversarial attacks on imagenet

Attack classification models with transferability, black-box attack; unrestricted adversarial attacks on imagenet, CVPR2021 安全AI挑战者计划第六期:ImageNet无限制对抗攻击 决赛第四名(team name: Advers)

null 51 Dec 1, 2022
transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

transfer_adv CVPR-2021 AIC-VI: unrestricted Adversarial Attacks on ImageNet CVPR2021 安全AI挑战者计划第六期赛道2:ImageNet无限制对抗攻击 介绍 : 深度神经网络已经在各种视觉识别问题上取得了最先进的性能。

null 25 Dec 8, 2022
Code for "Diversity can be Transferred: Output Diversification for White- and Black-box Attacks"

Output Diversified Sampling (ODS) This is the github repository for the NeurIPS 2020 paper "Diversity can be Transferred: Output Diversification for W

null 50 Dec 11, 2022
[CVPR 2021] Pytorch implementation of Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs

Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs In this work, we propose a framework HijackGAN, which enables non-linear latent space travers

Hui-Po Wang 46 Sep 5, 2022
Explainer for black box models that predict molecule properties

Explaining why that molecule exmol is a package to explain black-box predictions of molecules. The package uses model agnostic explanations to help us

White Laboratory 172 Dec 19, 2022
A method that utilized Generative Adversarial Network (GAN) to interpret the black-box deep image classifier models by PyTorch.

A method that utilized Generative Adversarial Network (GAN) to interpret the black-box deep image classifier models by PyTorch.

Yunxia Zhao 3 Dec 29, 2022
Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)

Retrieving Black-box Optimal Images from External Databases (WSDM 2022) We propose how a user retreives an optimal image from external databases of we

joisino 5 Apr 13, 2022
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 5, 2022
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 5, 2022
A Comparative Framework for Multimodal Recommender Systems

Cornac Cornac is a comparative framework for multimodal recommender systems. It focuses on making it convenient to work with models leveraging auxilia

Preferred.AI 671 Jan 3, 2023
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

ShopRunner 97 Jan 3, 2023
Open-sourcing the Slates Dataset for recommender systems research

FINN.no Recommender Systems Slate Dataset This repository accompany the paper "Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sa

FINN.no 48 Nov 28, 2022
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

null 419 Jan 3, 2023
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Collie do

ShopRunner 96 Dec 29, 2022
An efficient PyTorch implementation of the evaluation metrics in recommender systems.

recsys_metrics An efficient PyTorch implementation of the evaluation metrics in recommender systems. Overview • Installation • How to use • Benchmark

Xingdong Zuo 12 Dec 2, 2022
Crab is a flexible, fast recommender engine for Python that integrates classic information filtering recommendation algorithms in the world of scientific Python packages (numpy, scipy, matplotlib).

Crab - A Recommendation Engine library for Python Crab is a flexible, fast recommender engine for Python that integrates classic information filtering r

python-recsys 1.2k Dec 21, 2022
A python library for implementing a recommender system

python-recsys A python library for implementing a recommender system. Installation Dependencies python-recsys is build on top of Divisi2, with csc-pys

Oscar Celma 1.5k Dec 17, 2022
StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking

StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking Datasets You can download datasets that have been pre-pr

null 25 May 29, 2022
A TikTok-like recommender system for GitHub repositories based on Gorse

GitRec GitRec is the missing recommender system for GitHub repositories based on Gorse. Architecture The trending crawler crawls trending repositories

null 337 Jan 4, 2023