Behavioral "black-box" testing for recommender systems

Jacopo Tagliabue

Last update: Dec 30, 2022

Related tags

Overview

RecList

RecList

Free software: MIT license
Documentation: https://reclist.readthedocs.io.

Overview

RecList is an open source library providing behavioral, "black-box" testing for recommender systems. Inspired by the pioneering work of Ribeiro et al. 2020 in NLP, we introduce a general plug-and-play procedure to scale up behavioral testing, with an easy-to-extend interface for custom use cases.

RecList ships with some popular datasets and ready-made behavioral tests: check the paper for more details on the relevant literature and the philosophical motivations behind the project.

If you are not familiar with the library, we suggest first taking our small tour to get acquainted with the main abstractions through ready-made models and public datasets.

Quick Links

Our paper, with in-depth analysis, detailed use cases and scholarly references.
A colab notebook (WIP), showing how to train a cart recommender model from scratch and use the library to test it.
Our blog post (forthcoming), with examples and practical tips.

Project updates

Nov. 2021: the library is currently in alpha (i.e. enough working code to finish the paper and tinker with it). We welcome early feedback, but please be advised that the package may change substantially in the upcoming months ("If you're not embarrassed by the first version, you've launched too late").

As the project is in active development, come back often for updates.

Summary

This doc is structured as follows:

Quick Start
A Guided Tour
Capabilities
Roadmap
Acknowledgments
License and Citation

Quick Start

If you want to see RecList in action, clone the repository, create and activate a virtual env, and install the required packages from root. If you prefer to experiment in an interactive, no-installation-required fashion, try out our colab notebook.

Sample scripts are divided by use-cases: similar items, complementary items or session-based recommendations. When executing one, a suitable public dataset will be downloaded, and a baseline ML model trained: finally, the script will run a pre-made suite of behavioral tests to show typical results.

git clone https://github.com/jacopotagliabue/reclist
cd reclist
python3 -m venv venv
source venv/bin/activate
pip install -e .
python examples/coveo_complementary_rec.py

Running your model on one of the supported dataset, leveraging the pre-made tests, is as easy as implementing a simple interface, RecModel.

Once you've run successfully the sample script, take the guided tour below to learn more about the abstractions and the out-of-the-box capabilities of RecList.

A Guided Tour

An instance of RecList represents a suite of tests for recommender systems: given a dataset (more appropriately, an instance of RecDataset) and a model (an instance of RecModel), it will run the specified tests on the target dataset, using the supplied model.

For example, the following code instantiates a pre-made suite of tests that contains sensible defaults for a cart recommendation use case:

rec_list = CoveoCartRecList(
    model=model,
    dataset=coveo_dataset
)
# invoke rec_list to run tests
rec_list(verbose=True)

Our library pre-packages standard recSys KPIs and important behavioral tests, divided by use cases, but it is built with extensibility in mind: you can re-use tests in new suites, or you can write new domain-specific suites and tests.

Any suite must inherit the RecList interface, and then declare with Pytonic decorators its tests: in this case, the test re-uses a standard function:

class MyRecList(RecList):

    @rec_test(test_type='stats')
    def basic_stats(self):
        """
        Basic statistics on training, test and prediction data
        """
        from reclist.metrics.standard_metrics import statistics
        return statistics(self._x_train,
            self._y_train,
            self._x_test,
            self._y_test,
            self._y_preds)

Any model can be tested, as long as its predictions are wrapped in a RecModel. This allows for pure "black-box" testings, a SaaS provider can be tested just by wrapping the proper API call in the method:

class MyCartModel(RecModel):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def predict(self, prediction_input: list, *args, **kwargs):
        """
        Implement the abstract method, accepting a list of lists, each list being
        the content of a cart: the predictions returned by the model are the top K
        items suggested to complete the cart.
        """

        return

While many standard KPIs are available in the package, the philosophy behind RecList is that metrics like Hit Rate provide only a partial picture of the expected behavior of recommenders in the wild: two models with very similar accuracy can have very different behavior on, say, the long-tail, or model A can be better than model B overall, but at the expense of providing disastrous performance on a set of inputs that are particularly important in production.

RecList recognizes that outside of academic benchmarks, some mistakes are worse than others, and not all inputs are created equal: when possible, it tries to operationalize through scalable code behavioral insights for debugging and error analysis; it also provides extensible abstractions when domain knowledge and custom logic are needed.

Once you run a suite of tests, results are dumped automatically and versioned in a local folder, structured as follows (name of the suite, name of the model, run timestamp):

.reclist/
  myList/
    myModel/
      1637357392/
      1637357404/

We provide a simple (and very WIP) UI to easily compare runs and models. After you run two times one of the example scripts, you can do:

cd app
python app.py

to start a local web app that lets you explore test results:

If you select more than model, the app will automatically build comparison tables:

If you start using RecList as part of your standard testings - either for research or production purposes - you can use the JSON report for machine-to-machine communication with downstream system (e.g. you may want to automatically fail the model pipeline if certain behavioral tests are not passed).

Capabilities

RecList provides a dataset and model agnostic framework to scale up behavioral tests. As long as the proper abstractions are implemented, all the out-of-the-box components can be re-used. For example:

you can use a public dataset provided by RecList to train your new cart recommender model, and then use the RecTests we provide for that use case;
you can use some baseline model on your custom dataset, to establish a baseline for your project;
you can use a custom model, on a private dataset and define from scratch a new suite of tests, mixing existing methods and domain-specific tests.

We list below what we currently support out-of-the-box, with particular focus on datasets and tests, as the models we provide are convenient baselines, but they are not meant to be SOTA research models.

Datasets

RecList features convenient wrappers around popular datasets, to help test models over known benchmarks in a standardized way.

Coveo Data Challenge
The Million Playlist Dataset (coming soon)
MovieLens (coming soon)

Behavioral Tests

Coming soon!

Roadmap

To do:

the app is just a stub: improve the report "contract" and extend the app capabilities, possibly including it in the library itself;
continue adding default RecTests by use cases, and test them on public datasets;
improving our test suites and refactor some abstractions;
adding Colab tutorials, extensive documentation and a blog-like write-up to explain the basic usage.

We maintain a small Trello board on the project which we plan on sharing with the community: more details coming soon!

Contributing

We will update this repo with some guidelines for contributions as soon as the codebase becomes more stable. Check back often for updates!

Acknowledgments

The main contributors are:

Patrick John Chia - LinkedIn, GitHub
Jacopo Tagliabue - LinkedIn, GitHub
Federico Bianchi - LinkedIn, GitHub
Chloe He - LinkedIn, GitHub
Brian Ko - LinkedIn, GitHub

If you have questions or feedback, please reach out to: jacopo dot tagliabue at tooso dot ai.

License and Citation

All the code is released under an open MIT license. If you found RecList useful, or you are using it to benchmark/debug your model, please cite our pre-print (forhtcoming):

@inproceedings{Chia2021BeyondNB,
  title={Beyond NDCG: behavioral testing of recommender systems with RecList},
  author={Patrick John Chia and Jacopo Tagliabue and Federico Bianchi and Chloe He and Brian Ko},
  year={2021}
}

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Comments

Reclist tutorial - Hit Rate by Brand graph

Description

I ran through the provided RecList Colab Notebook tutorial. I also read the blog on RecList.

In the blog post, it shows the Hit Rate @ 10 by brand. Picture below.

I did not see how to generate the HR@10 by brand in the Colab Notebook example. Would you mind sharing the code on how to generate that please?

Thank you

opened by kmcmearty 1
How to pick use case clarification

Description

When I pick a use case from Reclist and I want to build off of one of them, eg complementary items, does that mean I would instantiate the CoveoCartRecList class?

From the Colab notebook tutorial:

`In particular in this tutorial, we are targeting a "complementary items" use cases, such as for example a cart recommender:

if a shopper added item X to the cart, what is she likely to add next? We train a simple, yet effective prod2vec baseline model, re-using for convenience a "training embedding" function already implemented by recsys. `

For example, say I'm interested in the similar item or session based use cases from RecList, would I call MovieLensSimilarItemRecList or SpotifySessionRecList() to start my build from?

I'm trying to get some clarification on the right way to use this library from the blog post and colab notebook when it says "Step #1 Pick a use case"

Thank you

opened by kmcmearty 1
may you share y_train data for CoveoDataset please
may you share data for CoveoDataset please as we can see y_train is not provided https://github.com/jacopotagliabue/reclist/blob/main/reclist/datasets.py

self._x_train = data["x_train"] self._y_train = None self._x_test = data["x_test"] self._y_test = data["y_test"] self._catalog = data["catalog"]

x_train can be downloaded len(x_train) 927357 but y_train not
opened by Sandy4321 1

Update precision_at_k

Description

Line 123 in 'reclist/reclist/metrics/standard_metrics.py'(url) Is precision@k is the percentage of the relevant documents that is viewed? Is that the definition followed here?

def precision_at_k(y_preds, y_test, k=3):
    precision_ls = [len(set(_y).intersection(set(_p[:k]))) / len(_p) if _p else 1 for _p, _y in zip(y_preds, y_test)]
    return np.average(precision_ls)

should not the len(_p) be len(_p[:k]) or the precision@k is defined is some other way? This is what I think should be precision@k if we follow the standard definition


def precision_at_k(y_preds, y_test, k=3):
    precision_ls = [len(set(_y).intersection(set(_p[:k]))) / len(_p[:k]) if _p else 1 for _p, _y in zip(y_preds, y_test)]
    return np.average(precision_ls)

opened by nsbits 0

Datalab 4426: Add BBC Sounds datasets and models

This PR adds classes and methods to load a BBC Sounds dataset from GCP; it also adds a dummy lightFM model, which loads pre-generated predictions, and functions to generate metrics by custom (age and gender) groups).

opened by alepiscopo 0
Feature Requests: Smaller Datasets as example.
RecList version: 0.3.1

Python version: 3.11

Operating System: Linux/Windows

Description

Example datasets are really large. They tend to make the system run out of memory. Can we have smaller example datasets to run the tests?
enhancement
opened by unna97 0
Issue with Reclist package pip install.
RecList version:

Python version: 3.10

Operating System: Windows

Description

Failed to install reclist on the local machine (windows) and GitHub codespaces (Linux). Runs into an issue with matplotlib and gensim.

What I Did

Upgrade the versions in requirements.txt

!pip install reclist
opened by unna97 1

Update popularity bias at k

Description

Line 118 in 'reclist/reclist/metrics/standard_metrics.py'(url) The denominator should be len(p[:k)

def popularity_bias_at_k(y_preds, x_train, k=3):
    # estimate popularity from training data
    pop_map = collections.defaultdict(lambda : 0)
    num_interactions = 0
    for session in x_train:
        for event in session:
            pop_map[event] += 1
            num_interactions += 1
    # normalize popularity
    pop_map = {k:v/num_interactions for k,v in pop_map.items()}
    all_popularity = []
    for p in y_preds:
        average_pop = sum(pop_map.get(_, 0.0) for _ in p[:k]) / len(p) if len(p) > 0 else 0
        all_popularity.append(average_pop)
    return sum(all_popularity) / len(y_preds)

should not the len(_p) be len(_p[:k]) ? we will be looking till the kth slice This is what I think should be popularity_bias@k


def popularity_bias_at_k(y_preds, x_train, k=3):
    # estimate popularity from training data
    pop_map = collections.defaultdict(lambda : 0)
    num_interactions = 0
    for session in x_train:
        for event in session:
            pop_map[event] += 1
            num_interactions += 1
    # normalize popularity
    pop_map = {k:v/num_interactions for k,v in pop_map.items()}
    all_popularity = []
    for p in y_preds:
        average_pop = sum(pop_map.get(_, 0.0) for _ in p[:k]) / len(p[:k]) if len(p) > 0 else 0
        all_popularity.append(average_pop)
    return sum(all_popularity) / len(y_preds)

opened by nsbits 0

Owner

Jacopo Tagliabue

I failed the Turing Test once, but that was many friends ago.

GitHub

Attack classification models with transferability, black-box attack; unrestricted adversarial attacks on imagenet

Attack classification models with transferability, black-box attack; unrestricted adversarial attacks on imagenet, CVPR2021 安全AI挑战者计划第六期：ImageNet无限制对抗攻击决赛第四名（team name: Advers）

51 Dec 1, 2022

transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

transfer_adv CVPR-2021 AIC-VI: unrestricted Adversarial Attacks on ImageNet CVPR2021 安全AI挑战者计划第六期赛道2：ImageNet无限制对抗攻击介绍：深度神经网络已经在各种视觉识别问题上取得了最先进的性能。

25 Dec 8, 2022

Code for "Diversity can be Transferred: Output Diversification for White- and Black-box Attacks"

Output Diversified Sampling (ODS) This is the github repository for the NeurIPS 2020 paper "Diversity can be Transferred: Output Diversification for W

50 Dec 11, 2022

[CVPR 2021] Pytorch implementation of Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs

Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs In this work, we propose a framework HijackGAN, which enables non-linear latent space travers

46 Sep 5, 2022

Explainer for black box models that predict molecule properties

Explaining why that molecule exmol is a package to explain black-box predictions of molecules. The package uses model agnostic explanations to help us

172 Dec 19, 2022

A method that utilized Generative Adversarial Network (GAN) to interpret the black-box deep image classifier models by PyTorch.

3 Dec 29, 2022

Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)

Retrieving Black-box Optimal Images from External Databases (WSDM 2022) We propose how a user retreives an optimal image from external databases of we

5 Apr 13, 2022

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

2 Oct 5, 2022

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

2 Oct 5, 2022

A Comparative Framework for Multimodal Recommender Systems

Cornac Cornac is a comparative framework for multimodal recommender systems. It focuses on making it convenient to work with models leveraging auxilia

671 Jan 3, 2023

A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

97 Jan 3, 2023

Open-sourcing the Slates Dataset for recommender systems research

FINN.no Recommender Systems Slate Dataset This repository accompany the paper "Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sa

48 Nov 28, 2022

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

419 Jan 3, 2023

A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Collie do

96 Dec 29, 2022

An efficient PyTorch implementation of the evaluation metrics in recommender systems.

recsys_metrics An efficient PyTorch implementation of the evaluation metrics in recommender systems. Overview • Installation • How to use • Benchmark

12 Dec 2, 2022

Crab is a ﬂexible, fast recommender engine for Python that integrates classic information ﬁltering recommendation algorithms in the world of scientiﬁc Python packages (numpy, scipy, matplotlib).

Crab - A Recommendation Engine library for Python Crab is a ﬂexible, fast recommender engine for Python that integrates classic information ﬁltering r

1.2k Dec 21, 2022

Behavioral "black-box" testing for recommender systems

Related tags

Overview

RecList

Overview

Quick Links

Project updates

Summary

Quick Start

A Guided Tour

Capabilities

Datasets

Behavioral Tests

Roadmap

Contributing

Acknowledgments

License and Citation

Credits

Comments

Description

Description

Description

Description

Description

What I Did

Description

Owner

Jacopo Tagliabue

Attack classification models with transferability, black-box attack; unrestricted adversarial attacks on imagenet

transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

Code for "Diversity can be Transferred: Output Diversification for White- and Black-box Attacks"

[CVPR 2021] Pytorch implementation of Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs

Explainer for black box models that predict molecule properties

A method that utilized Generative Adversarial Network (GAN) to interpret the black-box deep image classifier models by PyTorch.

Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

A Comparative Framework for Multimodal Recommender Systems

A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

Open-sourcing the Slates Dataset for recommender systems research

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

An efficient PyTorch implementation of the evaluation metrics in recommender systems.

Crab is a ﬂexible, fast recommender engine for Python that integrates classic information ﬁltering recommendation algorithms in the world of scientiﬁc Python packages (numpy, scipy, matplotlib).

A python library for implementing a recommender system

StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking

A TikTok-like recommender system for GitHub repositories based on Gorse