Best Practices on Recommendation Systems

Microsoft

Last update: Jan 3, 2023

Related tags

Deep Learning microsoft python kubernetes data-science machine-learning tutorial deep-learning azure rating jupyter-notebook artificial-intelligence ranking recommender recommendation-system recommendation-engine recommendation recommendation-algorithm operationalization

Overview

Recommenders

What's New (February 4, 2021)

We have a new relase Recommenders 2021.2!

It comes with lots of bug fixes, optimizations and 3 new algorithms, GeoIMC, Standard VAE and Multinomial VAE. We also added tools to facilitate the use of Microsoft News dataset (MIND). In addition, we publised our KDD2020 tutorial where we built a recommender of COVID papers using Microsoft Academic Graph.

We also changed the default branch from master to main. Now when you download the repo, you will get main branch.

See past announcements in NEWS.md.

Introduction

This repository contains examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on five key tasks:

Prepare Data: Preparing and loading data for each recommender algorithm
Model: Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares (ALS) or eXtreme Deep Factorization Machines (xDeepFM).
Evaluate: Evaluating algorithms with offline metrics
Model Select and Optimize: Tuning and optimizing hyperparameters for recommender models
Operationalize: Operationalizing models in a production environment on Azure

Several utilities are provided in reco_utils to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting training/test data. Implementations of several state-of-the-art algorithms are included for self-study and customization in your own applications. See the reco_utils documentation.

For a more detailed overview of the repository, please see the documents on the wiki page.

Getting Started

Please see the setup guide for more details on setting up your machine locally, on a data science virtual machine (DSVM) or on Azure Databricks.

To setup on your local machine:

Install Anaconda with Python >= 3.6. Miniconda is a quick way to get started.
Clone the repository

git clone https://github.com/Microsoft/Recommenders

Run the generate conda file script to create a conda environment: (This is for a basic python environment, see SETUP.md for PySpark and GPU environment setup)

cd Recommenders
python tools/generate_conda_file.py
conda env create -f reco_base.yaml

Activate the conda environment and register it with Jupyter:

conda activate reco_base
python -m ipykernel install --user --name reco_base --display-name "Python (reco)"

Start the Jupyter notebook server

jupyter notebook

Run the SAR Python CPU MovieLens notebook under the 00_quick_start folder. Make sure to change the kernel to "Python (reco)".

NOTE - The Alternating Least Squares (ALS) notebooks require a PySpark environment to run. Please follow the steps in the setup guide to run these notebooks in a PySpark environment. For the deep learning algorithms, it is recommended to use a GPU machine.

Algorithms

The table below lists the recommender algorithms currently available in the repository. Notebooks are linked under the Environment column when different implementations are available.

Algorithm	Environment	Type	Description
Alternating Least Squares (ALS)	PySpark	Collaborative Filtering	Matrix factorization algorithm for explicit or implicit feedback in large datasets, optimized by Spark MLLib for scalability and distributed computing capability
Attentive Asynchronous Singular Value Decomposition (A2SVD)^*	Python CPU / Python GPU	Collaborative Filtering	Sequential-based algorithm that aims to capture both long and short-term user preferences using attention mechanism
Cornac/Bayesian Personalized Ranking (BPR)	Python CPU	Collaborative Filtering	Matrix factorization algorithm for predicting item ranking with implicit feedback
Convolutional Sequence Embedding Recommendation (Caser)	Python CPU / Python GPU	Collaborative Filtering	Algorithm based on convolutions that aim to capture both user’s general preferences and sequential patterns
Deep Knowledge-Aware Network (DKN)^*	Python CPU / Python GPU	Content-Based Filtering	Deep learning algorithm incorporating a knowledge graph and article embeddings to provide powerful news or article recommendations
Extreme Deep Factorization Machine (xDeepFM)^*	Python CPU / Python GPU	Hybrid	Deep learning based algorithm for implicit and explicit feedback with user/item features
FastAI Embedding Dot Bias (FAST)	Python CPU / Python GPU	Collaborative Filtering	General purpose algorithm with embeddings and biases for users and items
LightFM/Hybrid Matrix Factorization	Python CPU	Hybrid	Hybrid matrix factorization algorithm for both implicit and explicit feedbacks
LightGBM/Gradient Boosting Tree^*	Python CPU / PySpark	Content-Based Filtering	Gradient Boosting Tree algorithm for fast training and low memory usage in content-based problems
LightGCN	Python CPU / Python GPU	Collaborative Filtering	Deep learning algorithm which simplifies the design of GCN for predicting implicit feedback
GeoIMC	Python CPU	Hybrid	Matrix completion algorithm that has into account user and item features using Riemannian conjugate gradients optimization and following a geometric approach.
GRU4Rec	Python CPU / Python GPU	Collaborative Filtering	Sequential-based algorithm that aims to capture both long and short-term user preferences using recurrent neural networks
Multinomial VAE	Python CPU / Python GPU	Collaborative Filtering	Generative Model for predicting user/item interactions
Neural Recommendation with Long- and Short-term User Representations (LSTUR)^*	Python CPU / Python GPU	Content-Based Filtering	Neural recommendation algorithm with long- and short-term user interest modeling
Neural Recommendation with Attentive Multi-View Learning (NAML)^*	Python CPU / Python GPU	Content-Based Filtering	Neural recommendation algorithm with attentive multi-view learning
Neural Collaborative Filtering (NCF)	Python CPU / Python GPU	Collaborative Filtering	Deep learning algorithm with enhanced performance for implicit feedback
Neural Recommendation with Personalized Attention (NPA)^*	Python CPU / Python GPU	Content-Based Filtering	Neural recommendation algorithm with personalized attention network
Neural Recommendation with Multi-Head Self-Attention (NRMS)^*	Python CPU / Python GPU	Content-Based Filtering	Neural recommendation algorithm with multi-head self-attention
Next Item Recommendation (NextItNet)	Python CPU / Python GPU	Collaborative Filtering	Algorithm based on dilated convolutions and residual network that aims to capture sequential patterns
Restricted Boltzmann Machines (RBM)	Python CPU / Python GPU	Collaborative Filtering	Neural network based algorithm for learning the underlying probability distribution for explicit or implicit feedback
Riemannian Low-rank Matrix Completion (RLRMC)^*	Python CPU	Collaborative Filtering	Matrix factorization algorithm using Riemannian conjugate gradients optimization with small memory consumption.
Simple Algorithm for Recommendation (SAR)^*	Python CPU	Collaborative Filtering	Similarity-based algorithm for implicit feedback dataset
Short-term and Long-term preference Integrated Recommender (SLi-Rec)^*	Python CPU / Python GPU	Collaborative Filtering	Sequential-based algorithm that aims to capture both long and short-term user preferences using attention mechanism, a time-aware controller and a content-aware controller
Standard VAE	Python CPU / Python GPU	Collaborative Filtering	Generative Model for predicting user/item interactions
Surprise/Singular Value Decomposition (SVD)	Python CPU	Collaborative Filtering	Matrix factorization algorithm for predicting explicit rating feedback in datasets that are not very large
Term Frequency - Inverse Document Frequency (TF-IDF)	Python CPU	Content-Based Filtering	Simple similarity-based algorithm for content-based recommendations with text datasets
Vowpal Wabbit (VW)^*	Python CPU (online training)	Content-Based Filtering	Fast online learning algorithms, great for scenarios where user features / context are constantly changing
Wide and Deep	Python CPU / Python GPU	Hybrid	Deep learning algorithm that can memorize feature interactions and generalize user features
xLearn/Factorization Machine (FM) & Field-Aware FM (FFM)	Python CPU	Content-Based Filtering	Quick and memory efficient algorithm to predict labels with user/item features

NOTE: ^* indicates algorithms invented/contributed by Microsoft.

Independent or incubating algorithms and utilities are candidates for the contrib folder. This will house contributions which may not easily fit into the core repository or need time to refactor or mature the code and add necessary tests.

Algorithm	Environment	Type	Description
SARplus ^*	PySpark	Collaborative Filtering	Optimized implementation of SAR for Spark

Preliminary Comparison

We provide a benchmark notebook to illustrate how different algorithms could be evaluated and compared. In this notebook, the MovieLens dataset is split into training/test sets at a 75/25 ratio using a stratified split. A recommendation model is trained using each of the collaborative filtering algorithms below. We utilize empirical parameter values reported in literature here. For ranking metrics we use k=10 (top 10 recommended items). We run the comparison on a Standard NC6s_v2 Azure DSVM (6 vCPUs, 112 GB memory and 1 P100 GPU). Spark ALS is run in local standalone mode. In this table we show the results on Movielens 100k, running the algorithms for 15 epochs.

Algo	MAP	nDCG@k	Precision@k	Recall@k	RMSE	MAE	R²	Explained Variance
ALS	0.004732	0.044239	0.048462	0.017796	0.965038	0.753001	0.255647	0.251648
BPR	0.105365	0.389948	0.349841	0.181807	N/A	N/A	N/A	N/A
FastAI	0.025503	0.147866	0.130329	0.053824	0.943084	0.744337	0.285308	0.287671
LightGCN	0.088526	0.419846	0.379626	0.144336	N/A	N/A	N/A	N/A
NCF	0.107720	0.396118	0.347296	0.180775	N/A	N/A	N/A	N/A
SAR	0.110591	0.382461	0.330753	0.176385	1.253805	1.048484	-0.569363	0.030474
SVD	0.012873	0.095930	0.091198	0.032783	0.938681	0.742690	0.291967	0.291971

Contributing

This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Build Status

These tests are the nightly builds, which compute the smoke and integration tests. main is our principal branch and staging is our development branch. We use pytest for testing python utilities in reco_utils and papermill for the notebooks. For more information about the testing pipelines, please see the test documentation.

DSVM Build Status

The following tests run on a Windows and Linux DSVM daily. These machines run 24/7.

Build Type	Branch	Branch
Linux CPU	main	staging
Linux GPU	main	staging
Linux Spark	main	staging

Related projects

Microsoft AI Github: Find other Best Practice projects, and Azure AI design patterns in our central repository.
NLP best practices: Best practices and examples on NLP.
Computer vision best practices: Best practices and examples on computer vision.
Forecasting best practices: Best practices and examples on time series forecasting.

Reference papers

A. Argyriou, M. González-Fierro, and L. Zhang, "Microsoft Recommenders: Best Practices for Production-Ready Recommendation Systems", WWW 2020: International World Wide Web Conference Taipei, 2020. Available online: https://dl.acm.org/doi/abs/10.1145/3366424.3382692
L. Zhang, T. Wu, X. Xie, A. Argyriou, M. González-Fierro and J. Lian, "Building Production-Ready Recommendation System at Scale", ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2019 (KDD 2019), 2019.
S. Graham, J.K. Min, T. Wu, "Microsoft recommenders: tools to accelerate developing recommender systems", RecSys '19: Proceedings of the 13th ACM Conference on Recommender Systems, 2019. Available online: https://dl.acm.org/doi/10.1145/3298689.3346967

Comments

Wikidata
Description

The final objetive if to use Wikidata as a new Knowledge Graph for Recommendation algorithms, and to extract entities description to use new datasets (like Movielens) with DKN in DKN. This is the first step in that direction. I have implemented:

New utils functions to do specific queries in Wikidata:

Query list of related entities from a string representing the name of an entity. The goal is to be able to create a Knowledge Graph from the linked entities in Wikidata

Query entity description string representing the name of an entity

To test the functions created I have added a new notebook. The first section consists on creating a Knowledge Graph from the linked entities in Wikidata, and visualising the result of the KG. The second part tests the enriching of the name of an entity with their description and list of related entities, the goal is using this enriching for new datasets (like Movielens) with DKN.

Related Issues

#525

Checklist:

[x] My code follows the code style of this project, as detailed in our contribution guidelines.

[x] I have added tests. -> I have added tests in the notebook, should I add more?

[x] I have updated the documentation accordingly.
opened by almudenasanz 29

[BUG] Spark smoke test error with Criteo

Description

After upgrading LightGBM and the Spark version, we got this error in the nightly smoke tests, however, we have been running the same code for a long time without this error. It looks like a performance degradation

tests/smoke/examples/test_notebooks_pyspark.py .RRRRRF

=================================== FAILURES ===================================
_____________________ test_mmlspark_lightgbm_criteo_smoke ______________________

notebooks = {'als_deep_dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/als_deep_dive.ipynb', 'a..._dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb', ...}
output_notebook = 'output.ipynb', kernel_name = 'python3'

    @pytest.mark.flaky(reruns=5, reruns_delay=2)
    @pytest.mark.smoke
    @pytest.mark.spark
    @pytest.mark.skipif(sys.platform == "win32", reason="Not implemented on Windows")
    def test_mmlspark_lightgbm_criteo_smoke(notebooks, output_notebook, kernel_name):
        notebook_path = notebooks["mmlspark_lightgbm_criteo"]
        pm.execute_notebook(
            notebook_path,
            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
        )

            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
        )
    



        results = sb.read_notebook(output_notebook).scraps.dataframe.set_index("name")[
            "data"
        ]
>       assert results["auc"] == pytest.approx(0.68895, rel=TOL, abs=ABS_TOL)
E       assert 0.6292474883613918 == 0.68895 ± 5.0e-02
E        +  where 0.68895 ± 5.0e-02 = <function approx at 0x7f46b6e30840>(0.68895, rel=0.05, abs=0.05)
E        +    where <function approx at 0x7f46b6e30840> = pytest.approx

In which platform does it happen?

How do we replicate the issue?

see details: https://dev.azure.com/best-practices/recommenders/_build/results?buildId=56132&view=logs&j=80b1c078-4399-5286-f869-6bc90f734ab9&t=5e8b8b4f-32ea-5957-d349-aae815b05487

Expected behavior (i.e. solution)

Other Comments

This error is so weird, did LightGBM from SynapseML changed somehow? FYI @anargyri @simonzhaoms

bug

opened by miguelgfierro 27

Docker Support
Description

Initialize a PR of Docker support for PySpark environment. This PR is for discussion with team to brainstorm and optimize Docker support for the repo.

NOTE

I did not use conda yaml file in the repo to build conda env because the base image from https://github.com/jupyter/docker-stacks handles Jupyter kernel separately and has already installed many packages that exist in our yaml file.

To make the image light weighted, the conda/pip packages duplicated in both the base image and those in our yaml file are removed

TODO

Create pre-built image and publish in Docker hub

Test running the Docker container - I see this a good example. Maybe we want to adopt it?

Finish the rest of the three Docker images, i.e., CPU and GPU

HOW-TO A sample image is created and publicitized on my own Docker hub account (we can and should create one for the team later on) In a Linux terminal or Windows powershell (assume Docker is pre-installed in the machine)

docker pull yueguoguo/reco_pyspark:latest docker run --rm -p 8888 yueguoguo/reco_pyspark

Open browser and go to localhost:8888 with the token generated in the above run of the image.

UPDATE 2019-04-08

The pre-built image refers to a branch in the repo where there are only notebooks executable in the environment. For example, in the PySpark environment, the deep learning notebooks which are supposed to run in a GPU environment, are removed, because the Python packages for these notebooks are not installed for the light-weight consideration.

2019-06-21

"one to bind all". The same Dockerfile can be used for building an image with different environment, i.e., CPU, GPU, and Spark. This can be done by using the specific build args, i.e., cpu, gpu, and pyspark, respectively

SETUP.md is updated accordingly

Master branch of the repo will be cloned

Related Issues

Discussed in #687

Checklist:

[x] My code follows the code style of this project, as detailed in our contribution guidelines.

[ ] I have added tests.

[x] I have updated the documentation accordingly.
opened by yueguoguo 27
Fix SAR normalization and add accuracy evaluation metrics
Description

The normalization method in the SAR algorithm does not seem to be correct - it is currently implemented as a division of the computed scores by the item similarity matrix for each user we have ratings (unary affinity). If we actually use this normalization technique when evaluating SAR, we get extremely bad relevance and ranking metrics. Furthermore, this method gets rid of outliers and skews the relevance and ordinality of generated recommendations. This shouldn't be the case. Normalizing the scores to the original rating scale should yield identical metrics.

The above fix will allow us to correctly evaluate the accuracy measures like RMSE, MAE, and log loss. This PR also adds that in the sar_movielens.ipynb notebook.

With this PR we get the same rank/relevance metrics as the non-normalized version and the following accuracy:

Model: Top K: 10 MAP: 0.110591 NDCG: 0.382461 Precision@K: 0.330753 Recall@K: 0.176385 RMSE: 3.697559 MAE: 3.513341 R2: -12.648769 Exp var: -0.442580 Logloss: 3.268522

To illustrate the problem with the current normalization, here are the current metrics with the (incorrect) normalization technique.

Model: Top K: 10 MAP: 0.000045 NDCG: 0.000736 Precision@K: 0.000742 Recall@K: 0.000118

Preview notebook link

We always need to normalize the scores so that RMSE and MAE are computed in the correct scale.

Related Issues

Closes https://github.com/microsoft/recommenders/issues/903

Checklist:

[x] I have followed the contribution guidelines and code style for this project.

[x] I have added tests covering my contributions.

[x] I have updated the documentation accordingly.

[x] This PR is being made to staging and not master.
opened by viktorku 26
[FEATURE] Create ADO pipeline for generating pypi package
Description

Andreas:

[x] Remove the sys path include

[x] Change the current pipeline so it installs the package locally, instead of using the path

[x] Update setup.py

[x] Fix issue with spark tests

[x] Fix the issue with GPU version to match TF 1.15

[x] Update documentation to reflect the new installation process

[x] Review docs evaluation and datasets

[x] test if the wheel package works on Databricks

Miguel:

[x] Add a new yaml file that when there is a new tag, builds a package with bdist, installs it, executes all tests (unit, smoke and integration)

[x] Create the github release draft

[x] Upload artifacts to the github release draft: wheel and compressed code

[x] Publish the wheel to the package limbo only when we are executing a release

[x] Add fix to the spark tests in the pipeline

[x] BUG: remove the wildcard in the installation and use the full name programmatically see comment

[x] Check why we are installing all the deps in CPU, instead of only the CPU ones see run

[x] Make sure we are using the package when running the tests, see comment here

[x] Fix issue with xlearn, it was using gpu deps that were not needed

[x] See if we can simplify the code when forcing exit on error by removing exit -1 on each line and add other instructions. See comment

[x] if smoke tests fail, we don't continue with the integration tests (nice to have feature). Working now

[x] Review docs common, reco, tuning:

[x] common

[x] recommender

[x] tunning

[x] Do a dry run with all the tests (it will take >8h) run worked on 3/6/2021

[x] Fix deeprec unit tests

[x] Analyze flaky tests and see if backoff lib can help (nice to have feature)

[x] Create a tag and check that all tests pass (it will take >8h)

[x] Check why automatic commit gives an error and manual trigger does not. See comment here

[x] test if the wheel package works on Synapse -> if we upload the wheel to Synapse pool, it installs the core deps

[x] Check if we can install extra deps (like spark or GPU) if we do pip install in the pool runtime

[ ] Reduce the time of the GPU smoke tests see details

[ ] Automatically add the tag name to the draft release (nice to have feature)

Yan

[x] Update the documentation to make sure it reflects the latest code changes, see issue https://github.com/microsoft/recommenders/issues/942

[x] Update docs/README.md

[x] automatically build the documentation on 3 environments: latest (main branch), staging and stable (latest tag) using https://readthedocs.org/projects/microsoft-recommenders/

Expected behavior with the suggested feature

Other Comments
enhancement
opened by miguelgfierro 24
Add Cornac BPR deep dive notebook
Description

Add Cornac Bayesian Personalized Ranking (BPR) deep dive notebook

Related Issues

#931

Checklist:

[x] I have followed the contribution guidelines and code style for this project.

[x] I have added tests covering my contributions.

[ ] I have updated the documentation accordingly.
opened by tqtg 19
[ASK] In NCF Deep dive and ncf_movielens notebook, I used my own dataset instead of movie lens, its has userID itemID and ratings (i used counts here as rating like implicit data). The notebook throws the following error? could someone help me out with this problem?
Description

Other Comments

Data set looks like this

rating userID itemID

0 12 3468 3644 1 3 3816 3959 2 1 2758 2650 3 1 5056 1593 4 30 3029 192

When I run this cell in the notebook I got the following error

data = NCFDataset(train=train, test=test, seed=SEED)

Error:

TypeError Traceback (most recent call last) in 1 SEED = 10 ----> 2 data = NCFDataset(train=train, test=test, seed=SEED)

~/Recommenders/reco_utils/recommender/ncf/dataset.py in init(self, train, test, n_neg, n_neg_test, col_user, col_item, col_rating, binary, seed) 59 # initialize negative sampling for training and test data 60 self._init_train_data() ---> 61 self._init_test_data() 62 # set random seed 63 random.seed(seed)

~/Recommenders/reco_utils/recommender/ncf/dataset.py in _init_test_data(self) 183 test_interact_status = pd.merge(test_interact_status, self.interact_status, on=self.col_user, how="left") 184 --> 185 test_interact_status[ self.col_item + "_negative"] = test_interact_status.apply(lambda row: row[self.col_item + "_negative"] - row[self.col_item + "_interacted_test"], axis=1) 186 test_ratings = pd.merge(self.test, test_interact_status[[self.col_user, self.col_item + "_negative"]], on=self.col_user, how="left") 187

~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds) 6485 args=args, 6486 kwds=kwds) -> 6487 return op.get_result() 6488 6489 def applymap(self, func):

~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in get_result(self) 149 return self.apply_raw() 150 --> 151 return self.apply_standard() 152 153 def apply_empty_result(self):

~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in apply_standard(self) 255 256 # compute the result using the series generator --> 257 self.apply_series_generator() 258 259 # wrap results

~/.local/lib/python3.6/site-packages/pandas-0.24.2-py3.6-macosx-10.7-x86_64.egg/pandas/core/apply.py in apply_series_generator(self) 284 try: 285 for i, v in enumerate(series_gen): --> 286 results[i] = self.f(v) 287 keys.append(v.name) 288 except Exception as e:

~/Recommenders/reco_utils/recommender/ncf/dataset.py in (row) 183 test_interact_status = pd.merge(test_interact_status, self.interact_status, on=self.col_user, how="left") 184 --> 185 test_interact_status[ self.col_item + "_negative"] = test_interact_status.apply(lambda row: row[self.col_item + "_negative"] - row[self.col_item + "_interacted_test"], axis=1) 186 test_ratings = pd.merge(self.test, test_interact_status[[self.col_user, self.col_item + "_negative"]], on=self.col_user, how="left") 187

TypeError: ("unsupported operand type(s) for -: 'float' and 'set'", 'occurred at index 854')

...
help wanted
opened by karthikraja95 19
[DISCUSSION] General folder structure for reco, cv, and forecasting repos

I'm setting this discussion public in case any of our users or customers want to provide feedback.

Context

We are building repos around computer vision and time series forecasting. We would like to homogenise the structure between them and the recommenders repo. The CV repo is still starting and the forecast repo has been running for some time internally and it is focused on benchmarks.

The idea is to have a common structure (and user experience) between the 3 repos. Trying to have the best of each: nice examples and utilities from recommenders, nice benchmarks from forecasting repo and support for CV, as well as the other solutions in reco and forecast.

Question

What will be the optimal structure that will help our users and us to build better solutions in recommendations, CV and forecasting?

Please provide answers in detail ways, example: e1) I would take the recommenders structure (notebooks, reco_utils, tests) and rename the folders to X, Y, Z... e2) I would take the recommenders structure (notebooks, reco_utils, tests) and add a folder for benchmarks... e3) ...
needs discussion style improvement

opened by miguelgfierro 18
Staging to main (SARplus, SASrec, NCF, RBM etc.)
Description

Merge recent changes into main.

Related Issues

Checklist:

[x] I have followed the contribution guidelines and code style for this project.

[x] I have added tests covering my contributions.

[x] I have updated the documentation accordingly.

[ ] This PR is being made to staging branch and not to main branch.
opened by anargyri 16
Unable to create an appropriately versioned cluster per instructions in reference architecture and als_movie_o16n
What is affected by this bug?

Creating an appropriate cluster

Running a notebook on that cluster

Unit tests.

In which platform does it happen?

Azure Databricks.

How do we replicate the issue?

Create a databricks workspace

Navigate to Clusters

Click [+Create Cluster] In the Databricks Runtime Version, there is no longer an option for DB 4.1, Spark 2.3.0. It was deprecated on 2019-01-17. See deprecation schedule here.

Expected behavior (i.e. solution)

Workarounds:

It is still possible to create a cluster by cloning a cluster of the recommended version.

It is still possible to create a cluster through the API. Happy to do PR with appropriate json for creating with databricks CLI or directly through the REST API.

Other Comments

Have we tested whether the cosmosdb connector jar works with more current versions of ADB and spark?
documentation
opened by jreynolds01 16
[FEATURE] Set up test machine linux
Description

from https://github.com/microsoft/recommenders/tree/master/tests

Make sure all tests pass:

Unit:

[x] pytest tests/unit -m "not notebooks and not spark and not gpu" --durations 0

[x] pytest tests/unit -m "notebooks and not spark and not gpu"

[x] pytest tests/unit -m "not notebooks and not spark and gpu"

[x] pytest tests/unit -m "notebooks and not spark and gpu"

[x] pytest tests/unit -m "not notebooks and spark and not gpu"

[x] pytest tests/unit -m "notebooks and spark and not gpu"

Smoke:

[x] pytest tests/smoke -m "smoke and not spark and not gpu" --durations 0

[x] pytest tests/smoke -m "smoke and not spark and gpu" --durations 0

[x] pytest tests/smoke -m "smoke and spark and not gpu" --durations 0

Integration:

[x] pytest tests/integration -m "integration and not spark and not gpu" --durations 0

[x] pytest tests/integration -m "integration and not spark and gpu" --durations 0

[x] pytest tests/integration -m "integration and spark and not gpu" --durations 0

Expected behavior with the suggested feature

Other Comments
enhancement
opened by miguelgfierro 15
[ASK]
I build a machine learning recommendation model with Wide And Deep based on 00_quick_start/wide_deep_movielens.ipynb and when I save the model I get three files [saved_model.pb, variables-data-00000-00001, variables.index]. I can then load this model with

self.model = tf.saved_model.load(path_to_saved_model_and_variables, tags="serve")

And I can make prediction with

self.model.signatures["predict"]

But is it also possible to train this saved model with new data?
help wanted
opened by JeroenMBooij 0
[ASK] SAR with timedecay_formula= False won't work

Description

How should I run SAR if I don't want to implement the time decay formula in SAR?

My model is constructed as:

model = SAR( col_user="userID", col_item="itemID", col_rating="rating", col_timestamp="timestamp", similarity_type= similarity_type, timedecay_formula = False )

and when fitting the model, it shows:

237 if df[select_columns].duplicated().any(): --> 238 raise ValueError("There should not be duplicates in the dataframe") 239 240 # generate continuous indices if this hasn't been done

ValueError: There should not be duplicates in the dataframe

Plese advise, thanks!

Other Comments
help wanted

opened by jamie613 1
AzureML tests: Durations, disable warnings and exit -1
Description

Add parameters to pytest: durations and disable warnings It also adds an exit -1 if there is a failure in the tests

Related Issues

Fix https://github.com/microsoft/recommenders/issues/1857 and https://github.com/microsoft/recommenders/issues/1852

References

Checklist:

[ ] I have followed the contribution guidelines and code style for this project.

[ ] I have added tests covering my contributions.

[ ] I have updated the documentation accordingly.

[ ] This PR is being made to staging branch and not to main branch.
opened by miguelgfierro 3

[ASK] Error in NCFDataset creation

Description

Hello all, i'm trying to use the NCF_deep_dive notebook with my own data. With the following structure

| usr_id | code_id | amt_trx | bestelldatum -- | -- | -- | -- | -- 0 | 0 | 35 | 1 | 2022-03-01 1 | 0 | 2 | 1 | 2022-03-01 2 | 0 | 18 | 1 | 2022-03-01 3 | 0 | 9 | 1 | 2022-03-01 4 | 0 | 0 | 1 | 2022-03-01

when I try to create the dataset i get the following error

data = NCFDataset(train_file=train_file, test_file=leave_one_out_test_file, seed=SEED, overwrite_test_file_full=True, col_user='usr_id', col_item='code_id', col_rating='amt_trx', binary=False)

---------------------------------------------------------------------------
MissingUserException                      Traceback (most recent call last)
Cell In [39], line 1
----> 1 data = NCFDataset(train_file=train_file,
      2                     test_file=leave_one_out_test_file,
      3                     seed=SEED,
      4                     overwrite_test_file_full=True,
      5                     col_user='usr_id',
      6                     col_item='code_id',
      7                     col_rating='amt_trx',
      8                     binary=False)

File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/dataset.py:376, in Dataset.__init__(self, train_file, test_file, test_file_full, overwrite_test_file_full, n_neg, n_neg_test, col_user, col_item, col_rating, binary, seed, sample_with_replacement, print_warnings)
    374         self.test_file_full = os.path.splitext(self.test_file)[0] + "_full.csv"
    375     if self.overwrite_test_file_full or not os.path.isfile(self.test_file_full):
--> 376         self._create_test_file()
    377     self.test_full_datafile = DataFile(
    378         filename=self.test_file_full,
    379         col_user=self.col_user,
   (...)
    383         binary=self.binary,
    384     )
    385 # set random seed

File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/dataset.py:417, in Dataset._create_test_file(self)
    415 if user in train_datafile.users:
    416     user_test_data = test_datafile.load_data(user)
--> 417     user_train_data = train_datafile.load_data(user)
    418     # for leave-one-out evaluation, exclude items seen in both training and test sets
    419     # when sampling negatives
    420     user_positive_item_pool = set(
    421         user_test_data[self.col_item].unique()
    422     ).union(user_train_data[self.col_item].unique())

File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/dataset.py:194, in DataFile.load_data(self, key, by_user)
    192 while (self.line_num == 0) or (self.row[key_col] != key):
    193     if self.end_of_file:
--> 194         raise MissingUserException("User {} not in file {}".format(key, self.filename))
    195     next(self)
    196 # collect user/test batch data

MissingUserException: User 58422 not in file ./train_new.csv

I made some checks print(train.usr_id.nunique()) --> output: 81062 print(test.usr_id.nunique()) --> output: 81062 print(leave.usr_id.nunique()) --> output: 81062

also checked by hand and the user 58422 is in all the files. Also the types are the same i'm using int64 for usr_id, code_id and amt_trx like movielens dataset

I can't understand the error, could you help me please?

Update

If i remove the parameter overwrite_test_file_full it creates the dataset but then I can't make predictions because the dataset object didn't create the user2id mapping

data = NCFDataset(train_file=train_file,
                    test_file=leave_one_out_test_file,
                    seed=SEED,
                    col_user='usr_id',
                    col_item='code_id',
                    col_rating='amt_trx',
                    print_warnings=True)

model = NCF (
    n_users=data.n_users, 
    n_items=data.n_items,
    model_type="NeuMF",
    n_factors=4,
    layer_sizes=[16,8,4],
    n_epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    learning_rate=1e-3,
    verbose=99,
    seed=SEED
)

predictions = [[row.usr_id, row.code_id, model.predict(row.usr_id, row.code_id)]
               for (_, row) in test.iterrows()]


predictions = pd.DataFrame(predictions, columns=['usr_id', 'code_id', 'prediction'])
predictions.head()

AttributeError                            Traceback (most recent call last)
Cell In [38], line 1
----> 1 predictions = [[row.usr_id, row.code_id, model.predict(row.usr_id, row.code_id)]
      2                for (_, row) in test.iterrows()]
      5 predictions = pd.DataFrame(predictions, columns=['usr_id', 'code_id', 'prediction'])
      6 predictions.head()

Cell In [38], line 1, in <listcomp>(.0)
----> 1 predictions = [[row.usr_id, row.code_id, model.predict(row.usr_id, row.code_id)]
      2                for (_, row) in test.iterrows()]
      5 predictions = pd.DataFrame(predictions, columns=['usr_id', 'code_id', 'prediction'])
      6 predictions.head()

File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/ncf_singlenode.py:434, in NCF.predict(self, user_input, item_input, is_list)
    431     return list(output.reshape(-1))
    433 else:
--> 434     output = self._predict(np.array([user_input]), np.array([item_input]))
    435     return float(output.reshape(-1)[0])

File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/ncf_singlenode.py:440, in NCF._predict(self, user_input, item_input)
    437 def _predict(self, user_input, item_input):
    438 
    439     # index converting
--> 440     user_input = np.array([self.user2id[x] for x in user_input])
    441     item_input = np.array([self.item2id[x] for x in item_input])
    443     # get feed dict

File /anaconda/envs/recsys/lib/python3.8/site-packages/recommenders/models/ncf/ncf_singlenode.py:440, in <listcomp>(.0)
    437 def _predict(self, user_input, item_input):
    438 
    439     # index converting
--> 440     user_input = np.array([self.user2id[x] for x in user_input])
    441     item_input = np.array([self.item2id[x] for x in item_input])
    443     # get feed dict

AttributeError: 'NCF' object has no attribute 'user2id'

help wanted

opened by mrcmoresi 0

[FEATURE] Add duration flag to AzureML tests

Description

Add --durations 0 to the tests

Expected behavior with the suggested feature

Other Comments

related to https://github.com/microsoft/recommenders/issues/1843
enhancement

opened by miguelgfierro 0

Releases(1.1.1)

1.1.1(Jul 20, 2022)
New algorithms or improvements

Reduce iterations of W&D to reduce the integration tests time in https://github.com/microsoft/recommenders/pull/1698

Implementation of most frequent recommendation in https://github.com/microsoft/recommenders/pull/1666

Implement time_now for sarplus in #1719 #1721

Add a fast failure in SAR+ if the similarity metric is not within the options in https://github.com/microsoft/recommenders/pull/1743

SAR item similarity dtype correction in https://github.com/microsoft/recommenders/pull/1751

Simplify SAR test data loading functions in https://github.com/microsoft/recommenders/pull/1752

Reformat SAR+ SQL queries in https://github.com/microsoft/recommenders/pull/1772

Add new item similarity metrics for SAR in https://github.com/microsoft/recommenders/pull/1754

New utilities or improvements

Rewrite get_top_k_items() to improve runtime in https://github.com/microsoft/recommenders/pull/1748

Optimized Spark recall_at_k time performance in https://github.com/microsoft/recommenders/pull/1796

New notebooks or improvements

Fix missing import in FastAI notebook https://github.com/microsoft/recommenders/pull/1708

Review NCF notebook in #1703 #1712

Review LigthFM notebook and add test in https://github.com/microsoft/recommenders/pull/1706

Review BPR notebook in https://github.com/microsoft/recommenders/pull/1704

Review LightGCN notebook in https://github.com/microsoft/recommenders/pull/1714

Review DKN notebook in https://github.com/microsoft/recommenders/pull/1722

Review SAR notebook #1738 #1768

Other features

Enable distributed tests with AzureML #1696 #1717 #1729 #1733 #1739 #1747 #1732 #1755 #1763 #1771 #1773 #1775 #1787 #1788 #1794

Added tests for Python 3.8 and 3.9 in https://github.com/microsoft/recommenders/pull/1756

Image of contributors in https://github.com/microsoft/recommenders/pull/1692

Update README.md in #1709 #1711 #1767

Error in codeowners file in https://github.com/microsoft/recommenders/pull/1699

Add test to check if CuDNN is enabled in https://github.com/microsoft/recommenders/pull/1715

Update docker image reference to internal registry in https://github.com/microsoft/recommenders/pull/1727

Fixed a link error in data_transform.ipynb in https://github.com/microsoft/recommenders/pull/1736

Added tests for ranking function get_top_k_items() in https://github.com/microsoft/recommenders/pull/1757

Fix memory error in CPU nightly workflow in https://github.com/microsoft/recommenders/pull/1759

Update test infrastructure explanation #1776 #1777

Added time performance tests in https://github.com/microsoft/recommenders/pull/1765

Add path filter to avoid triggering unit tests when we change a markdown in https://github.com/microsoft/recommenders/pull/1791

Full Changelog: https://github.com/microsoft/recommenders/compare/1.1.0...1.1.1
Source code(tar.gz)
Source code(zip)
recommenders-1.1.1-py3-none-any.whl(331.06 KB)
recommenders-1.1.1.tar.gz(256.83 KB)
1.1.0(Apr 1, 2022)
New algorithms or improvements

SASRec and SSEPT in Tensorflow 2.x in https://github.com/microsoft/recommenders/pull/1530 #1621 #1678

RBM Code Cleanup, model save and other additions in #1599 #1618 #1622

Overwrite older test file in NCF deep dive to avoid bug in https://github.com/microsoft/recommenders/pull/1674

SAR+ improvement and bug fixes #1636 #1644 #1680 #1671

NCF improvement and bug fixes in #1612

Remove drop_duplicates() from SAR method fix #1464 in https://github.com/microsoft/recommenders/pull/1588

SAR literal fix in https://github.com/microsoft/recommenders/pull/1663

New utilities or improvements

Update lightfm_utils.py in https://github.com/microsoft/recommenders/pull/1624

Change formats of user_ids and item_ids arg. in LigthFM in https://github.com/microsoft/recommenders/pull/1651

Fix randomness issue in spark_stratified_split() in https://github.com/microsoft/recommenders/pull/1654

Clarification for jaccard and lift similarity measures in https://github.com/microsoft/recommenders/pull/1668

Use numpy divide in explained variance in https://github.com/microsoft/recommenders/pull/1691

Change MovieLens URL from HTTP to HTTPS in https://github.com/microsoft/recommenders/pull/1677

Remove casting of user and item IDs in Spark evaluation in https://github.com/microsoft/recommenders/pull/1686

Persist intermediate data to avoid non-determinism caused by Spark lazy random evaluation in https://github.com/microsoft/recommenders/pull/1676 #1652

New notebooks or improvements

Fix notebook build failure on Spark 3.2 in https://github.com/microsoft/recommenders/pull/1608

Remove early stopping round from LightGBM example notebook in https://github.com/microsoft/recommenders/pull/1620

Other features

Enable Python 3.8 and 3.9 in https://github.com/microsoft/recommenders/pull/1626 #1617

Upgrade Python from 3.6 to 3.7 in ADO tests pipeline in https://github.com/microsoft/recommenders/pull/1627

Increase time out for GPU nightly tests in https://github.com/microsoft/recommenders/pull/1623

Lower LightGBM test AUC base value in https://github.com/microsoft/recommenders/pull/1619

Change timeouts for tests #1625 #1661 #1684

Scenario gaming in https://github.com/microsoft/recommenders/pull/1637

Limiting tests: reducing the time of the news recommendation GPU notebooks in https://github.com/microsoft/recommenders/pull/1656

Remove pydocumentdb in install_requires in https://github.com/microsoft/recommenders/pull/1629

Change and improve dependencies #1630 #1653

Fix Spark tuning test in https://github.com/microsoft/recommenders/pull/1635

Typos in markdown files and other files #1639 #1589 #1646 #1647 #1688

Update Dockerfile in https://github.com/microsoft/recommenders/pull/1645

Improve documentation #1648 #1669 #1682 #1690 #1672

Codecov Fix in https://github.com/microsoft/recommenders/pull/1665

Set Spark env variables in nightly test in https://github.com/microsoft/recommenders/pull/1655 #1659

Full Changelog: https://github.com/microsoft/recommenders/compare/1.0.0...1.1.0
Source code(tar.gz)
Source code(zip)
recommenders-1.1.0-py3-none-manylinux1_x86_64.whl(327.72 KB)
recommenders-1.1.0.tar.gz(247.42 KB)
1.0.0(Jan 13, 2022)
Backwards incompatible changes

TensorFlow upgrade to 2.6.1 / 2.7 #1574 , #1565 , #1540

New algorithms or improvements

Improve algos visibility #1542

LightGBM test improvement #1531

Fix Surprise and Python 3.7 #1540

TF-IDF runtime enhancement changes #1571

Add Spark 3.x support for SARplus #1566

New utilities or improvements

Upgrade to Spark v3 #1555 , #1549 , #1543

Move scikit-surprise and pymanopt from setup.py #1602

Issue with pymanopt #1606

New notebooks or improvements

Fix bugs in RBM notebooks #1581

Remove explicit mapping of ratings to integers from RBM notebooks #1585

Other features

Fix nightly workflows #1576 , #1548

Stabilize more flaky tests #1558

Miscellaneous Pipeline Fixes #1545

Optimize Notebook Unit Tests #1538

Development status change to production/stable #1579

Update dependencies #1569, #1570

Fix Databricks installation script #1531

Adding codespace deployment #1521

Improve GitHub tests #1518, #1578, #1590, #1592

Flake8 Fixes #1552 , #1550

Improvement in documentation #1591, #1598, #1594, #1603

Update release pipeline #1596

Source code(tar.gz)
Source code(zip)
recommenders-1.0.0-py3-none-manylinux1_x86_64.whl(311.20 KB)
recommenders-1.0.0.tar.gz(238.60 KB)
0.7.0(Sep 23, 2021)
Backwards incompatible changes

Renaming of folders #1485, #1478

Change of the PyPI package name to recommenders #1477

New algorithms or improvements

Missing import in VAE #1508

New utilities or improvements

retrying import #1487

Addition of diversity, novelty, coverage and serendipity metrics #1536, #1535, #1522, #1505, #1491, #1470, #1465

New notebooks or improvements

New notebook showcasing diversity, novelty, coverage, and serendipity metrics in Spark #1488, #1470, #1465

Other features

Enablement of LightGBM version 3 #1527

Enablement of all Python 3.7 micro versions #1474

Installation in virtualenv and venv #1520, #1476

Installation from PyPI in docker container #1509

Read the Docs builds #1529, #1528

Documentation improvements #1515, #1469, #1462

CI pipelines on GitHub workflows (WIP) #1517, #1503, #1499, #1494, #1490

Source code(tar.gz)
Source code(zip)
recommenders-0.7.0-py3-none-manylinux1_x86_64.whl(307.00 KB)
recommenders-0.7.0.tar.gz(234.73 KB)
0.6.0(Jun 18, 2021)
New utilities or improvements

Fix URL in unit tests #1447

Improve documentation #1446 #1440 #1436 #1428 #1426 #1425 #1415

Add retry to maybe_downlad function #1427

New notebooks or improvements

Notebook for diversity metrics #1416

Update evaluation notebook with new diversity metrics #1416

Fix xlearn notebook #1427

Other features

Generate package for PyPi #1445 #1442 #1441 #1429

Improve installation process #1455 #1431

Fix tests #1452 #1427

Generate pipeline for release #1427

Source code(tar.gz)
Source code(zip)
recommenders-0.6.0-py3-none-manylinux1_x86_64.whl(228.49 KB)
recommenders-0.6.0.tar.gz(175.56 KB)
0.5.0(Apr 30, 2021)
Repo structure

Default branch renamed from master to main #1284 #1278

New dataset and competition support

Microsoft News dataset (MIND) and Microsoft News Recommendation Challenge #1247 #1236

New algorithms or improvements

Optimize GPU usage of news recommendation algorithms #1235

Optimize surprise utilities #1224

GeoIMC algorithm #1204

Standard VAE algorithm #1194

Multinomial VAE algorithm #1194

New utilities or improvements

Operationalization example for sequential models #1254

Fix bug with fastai #1288

Fix bug in affinity matrix #1243

Fix conflict with MMLSpark version #1230

Fix negative feedback smapler #1200

New notebooks or improvements

Update AzureML Designer notebooks #1286 #1253

KDD2020 tutorial: paper recommendation with Microsoft Academic Graph #1208

Update o16n notebook for real time scoring #1176

Reduce verbosity on tensorflow notebooks #1276

Other features

Upgrade papermill and scrapbook for testing #1271 #1270 #1282 #1289

Fix tests #1244 #1242 #1226 #1218

Fix issue with spark installation #1186

Update python version #1202

Notice for java dependency #1209

Reactivate CICD pipelines #1284

Source code(tar.gz)
Source code(zip)
0.4.0(Apr 30, 2021)
New algorithms or improvements

DKN fix https://github.com/microsoft/recommenders/pull/1165

GeoIMC https://github.com/microsoft/recommenders/pull/1142

LSTUR #1137 #1080

NAML #1137 #1080

NPA #1137 #1080

NRMS #1137 #1080

LighGCN #1130 #1123

NextItNet #1130 #1126

Fix SAR #1128 #1023 #1018 #991

LightFM #1096

TFIDF recommender #1088

A2SVD #1010

GRU4Rec #1010

Caser #1010

SLi-Rec #1010

SARplus #955

BPR with cornac library #950 #944 #937

New utilities or improvements

MIND dataset https://github.com/microsoft/recommenders/pull/1153

Fix Text iterator https://github.com/microsoft/recommenders/pull/1133

Fix NNI utils #1131

Azure Designer dependencies #1115 #1101 #1095 #1077 #1060

Fix tests #1057 #1004 #954 #935 #932

New notebooks or improvements

DKN notebook with MIND dataset https://github.com/microsoft/recommenders/pull/1165 https://github.com/microsoft/recommenders/pull/1137

GeoIMC notebook https://github.com/microsoft/recommenders/pull/1142

LSTUR notebook #1137 #1080

NAML notebook #1137 #1080

NPA notebook #1137 #1080

NRMS notebook #1137 #1080

LighGCN notebook #1130 #1123

NextItNet notebook #1130 #1126

Implementation of Recommenders into Azure Designer #1115 #1101 #1095 #1060 #1036

NCF hyperparameter tunning notebook #1102 #1092

LightFM notebook #1096

TFIDF recommender notebook #1088

Add timer class into notebooks 1063

Fix xlearn notebook #1006 #974

o16n notebook fix #1003 #969

A2SVD notebook #1010

GRU4Rec notebook #1010

Caser notebook #1010

SLi-Rec notebook #1010

BPR with cornac notebook #950 #944 #937

Other features

Fix installation on Databricks https://github.com/microsoft/recommenders/pull/1161 #965

Fix docker https://github.com/microsoft/recommenders/pull/1146 #1120 #1070 #1058 #1034

Fix Azure blob version #1119

Pin TensorFlow #1098

Code structure refactor #1086

Business scenarios and glossary #1086

ADO artifact #1069

Avoid pandas>1 #1052

CICD #1002 #998 #994 #980

Source code(tar.gz)
Source code(zip)
0.3.1(Apr 30, 2021)
New algorithms or improvements

Improved SAR performance #914 #922

Utils for wikidata knowledge graph #881 #902

New utilities or improvements

Fixed bug in python evaluator #863

Updated nni version and utils #856

Updated sum check #874

Changed url download util to use requests #813

New notebooks or improvements

Optimized spark notebooks #864

New notebook on knowledge graph generation with wikidata #881 #902

Wide-deep hyperdrive notebook AzureML API update #847

Other features

Added Docker support (Docker file) for all of the three (CPU/GPU/Spark) environment

Added setup.py for pip installation #851

Added sphinx documentation #859

Published documentation on readthedocs #912

Fixed spark testing issues #850

Added tests with AzureML compute target #848 #846 #839 #823

Development of Xamarin app for movies recommendation using Recommenders engine https://github.com/microsoft/recommenders_engine_example_layout

Source code(tar.gz)
Source code(zip)
0.3.0(Apr 30, 2021)
New platform support

Windows support with tests #797 #726

New algorithms or improvements

LightGBM #633 #735

RLRMC #729

Changed seed for GPU algos for reproducibility #785 #748

Added benchmark #715

Fixed bugs in SAR #697 #619

New utilities or improvements

Python evaluation improvement by memoization #713

Improved tests #706

New algos for hyperparameter tuning with NNI #687

Criteo dataloader #642

Wrapper VW #592

Added more data formats #605

New metrics #580

New notebooks or improvements

SAR remote execution through AzureML #728

SAR remote execution of notebook through AzureML #681

LightGBM with small criteo on CPU #633

LightGBM o16n on Databricks with MMLSpark #735 #714 #682 #680

Hyperparameter tuning with NNI on Surprise SVD #687

Hyperparameter tuning with Hyperdrive #546

Other features

Fixed bugs in utilities, tests and notebooks

New unit, smoke and integration tests for the new algos

Source code(tar.gz)
Source code(zip)
0.2.0(Apr 30, 2021)
New Algorithms or improvements

Vowpal Wabbit (VW) https://github.com/Microsoft/Recommenders/pull/452

xDeepFM https://github.com/Microsoft/Recommenders/pull/453

DKN https://github.com/Microsoft/Recommenders/pull/453

NCF https://github.com/Microsoft/Recommenders/pull/392

RBM https://github.com/Microsoft/Recommenders/pull/390

FastAI Embedding dot Bias https://github.com/Microsoft/Recommenders/pull/411

Optimization of SAR

New utilities or improvements

Improved the performance of python splitters https://github.com/Microsoft/Recommenders/pull/517

Added GPU utilities

Added utilities for hyperparameter tuning

New Notebooks or improvements

Improved o16n notebook with ALS, Movielens and Databricks https://github.com/Microsoft/Recommenders/pull/475

Added a deep dive notebook on VW https://github.com/Microsoft/Recommenders/pull/452

Improved notebook for hyperparameter tuning on Spark https://github.com/Microsoft/Recommenders/pull/444

New notebook on FastAI Embedding dot Bias algo https://github.com/Microsoft/Recommenders/pull/411

New notebook of deep dive on NCF https://github.com/Microsoft/Recommenders/pull/392

New quick start notebook of RBM https://github.com/Microsoft/Recommenders/pull/390

New deep dive notebook of RBM https://github.com/Microsoft/Recommenders/pull/390

New quickstart notebook of xDeepFM with synthetic data

New quickstart notebook of DKN with synthetic data

New notebook on data transformation https://github.com/Microsoft/Recommenders/pull/384

Other features

Fixed bugs in utilities, tests and notebooks

Added an installation script for Databricks https://github.com/Microsoft/Recommenders/pull/457

Changed installer from a bash to a python script https://github.com/Microsoft/Recommenders/pull/512

Added a parameter to control pyspark version in the installer https://github.com/Microsoft/Recommenders/pull/461

Optimized tests to be quicker https://github.com/Microsoft/Recommenders/pull/486

New unit, smoke and integration tests for the new algos

Added GPU test pipeline https://github.com/Microsoft/Recommenders/pull/408

Improved Github metrics tracker https://github.com/Microsoft/Recommenders/pull/400

Source code(tar.gz)
Source code(zip)
0.1.1(Dec 12, 2018)
New Algorithms or improvements

Improved SAR single node for top k recommendations. User can decide if the recommended top k items to be sorted or not.

New utilities or improvements

Added data related utility functions like movielens data download in Python and PySpark.

Added new data split method (timestamp based split) added.

New Notebooks or improvements

Added an O16N notebook for Spark ALS movie recommender on Azure production services such as Databricks, Cosmos DB, and Kubernetes Services.

Added SAR deep dive notebook with single-node implementation demonstrated.

Added Surprise SVD deep dive notebook.

Added Surprise SVD integration test.

Added Surprise SVD ranking metrics evaluation.

Made quick-start notebooks consistent in terms of running settings, i.e., experiment protocols (e.g., data split, evaluation metrics, etc.) and algorithm parameters (e.g., hyper parameters, remove seen items, etc.).

Added a comparison notebook for easy benchmarking different algorithms.

Other features

Updated SETUP with Azure Databricks.

Added SETUP troubleshooting for Azure DSVM and Databricks.

Updated READMEs under each notebook directory to provide comprehensive guidelines.

Added smoke/integration tests on large movielens dataset (10mil and 20mil).

Updated the Spark settings of CI/CD machine to eliminate unexpected build failures such as "no space left issue".

Source code(tar.gz)
Source code(zip)
0.1.0(Nov 12, 2018)
New Algorithms or improvements

Development of SAR algorithm on three implementations:

SAR single node

SAR PySpark

SAR+

New utilities or improvements

Dataset splitters in Python and PySpark.

Rating and ranking metrics in Python and PySpark.

New Notebooks or improvements

ALS quickstart with Movielens

SAR single node quickstart with Movielens

SAR PySpark quickstart with Movielens

SAR+ quickstart with Movielens

Data splitter

ALS deep dive

SAR deep dive

Evaluation

Other features

Benchmark of the current algorithms.

Unit, smoke and integration tests for Python and PySpark environments.

Source code(tar.gz)
Source code(zip)

Best Practices on Recommendation Systems

Related tags

Overview

Recommenders

What's New (February 4, 2021)

Introduction

Getting Started

Algorithms

Preliminary Comparison

Contributing

Build Status

DSVM Build Status

Related projects

Reference papers

Comments

Description

Related Issues

Checklist:

Description

In which platform does it happen?

How do we replicate the issue?

Expected behavior (i.e. solution)

Other Comments

Description

Related Issues

Checklist:

Description

Related Issues

Checklist:

Description

Expected behavior with the suggested feature

Other Comments

Description

Related Issues

Checklist:

Description

Other Comments

Context

Question

Description

Related Issues

Checklist:

What is affected by this bug?

In which platform does it happen?

How do we replicate the issue?

Expected behavior (i.e. solution)

Other Comments

Description

Expected behavior with the suggested feature

Other Comments

Description

Other Comments

Description

Related Issues

References

Checklist:

Description

Update

Description

Expected behavior with the suggested feature

Other Comments

Releases(1.1.1)

1.1.1(Jul 20, 2022)

New algorithms or improvements

New utilities or improvements

New notebooks or improvements

Other features

1.1.0(Apr 1, 2022)

New algorithms or improvements

New utilities or improvements

New notebooks or improvements

Other features

1.0.0(Jan 13, 2022)

Backwards incompatible changes

New algorithms or improvements

New utilities or improvements

New notebooks or improvements

Other features

0.7.0(Sep 23, 2021)

Backwards incompatible changes