A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

Geoffrey Bolmier

Last update: Dec 19, 2022

Related tags

Overview

⚡ funk-svd

funk-svd is a Python 3 library implementing a fast version of the famous SVD algorithm popularized by Simon Funk during the Neflix Prize contest.

Numba is used to speed up our algorithm, enabling us to run over 10 times faster than Surprise's Cython implementation (cf. benchmark notebook).

Movielens 20M	RMSE	MAE	Time
Surprise	0.88	0.68	10 min 40 sec
Funk-svd	0.88	0.68	42 sec

Installation

Run pip install git+https://github.com/gbolmier/funk-svd in your terminal.

Contributing

All contributions, bug reports, bug fixes, enhancements, and ideas are welcome.

A detailed overview on how to contribute can be found in the contributor guide.

Quick example

run_experiment.py:

>>> from funk_svd.dataset import fetch_ml_ratings
>>> from funk_svd import SVD

>>> from sklearn.metrics import mean_absolute_error


>>> df = fetch_ml_ratings(variant='100k')

>>> train = df.sample(frac=0.8, random_state=7)
>>> val = df.drop(train.index.tolist()).sample(frac=0.5, random_state=8)
>>> test = df.drop(train.index.tolist()).drop(val.index.tolist())

>>> svd = SVD(lr=0.001, reg=0.005, n_epochs=100, n_factors=15,
...           early_stopping=True, shuffle=False, min_rating=1, max_rating=5)

>>> svd.fit(X=train, X_val=val)
Preprocessing data...

Epoch 1/...

>>> pred = svd.predict(test)
>>> mae = mean_absolute_error(test['rating'], pred)

>>> print(f'Test MAE: {mae:.2f}')
Test MAE: 0.75

Funk SVD for recommendation in a nutshell

We have a huge sparse matrix:

$R = \begin{pmatrix} {\color{Red} ?} & 2 & \cdots & {\color{Red} ?} & {\color{Red} ?} \\ {\color{Red} ?} & {\color{Red} ?} & \cdots & {\color{Red} ?} & 4.5 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 3 & {\color{Red} ?} & \cdots & {\color{Red} ?} & {\color{Red} ?} \\ {\color{Red} ?} & {\color{Red} ?} & \cdots & 5 & {\color{Red} ?} \end{pmatrix}$

storing known ratings for a set of users and items:

The idea is to estimate unknown ratings by factorizing the rating matrix into two smaller matrices representing user and item characteristics:

$P = \begin{pmatrix} 0.37 & \cdots & 0.69 \\ \vdots & \ddots & \vdots \\ \vdots & \ddots & \vdots \\ \vdots & \ddots & \vdots \\ 1.08 & \cdots & 0.24 \end{pmatrix} , Q = \begin{pmatrix} 0.09 & \cdots & \cdots & \cdots & 0.46 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0.51 & \cdots & \cdots & \cdots & 0.72 \end{pmatrix}$

We call these two matrices users and items latent factors. Then, by applying the dot product between both matrices we can reconstruct our rating matrix. The trick is that the empty values will now contain estimated ratings.

In order to get more accurate results, the global average rating as well as the user and item biases are used in addition:

$\bar{r} = \frac{1}{N} \sum_{i=1}^{N} K_{i}$

where K stands for known ratings.

$bu = \begin{pmatrix} 0.35 & \cdots & 0.07 \end{pmatrix}$

$bi = \begin{pmatrix} 0.16 & \cdots & 0.40 \end{pmatrix}$

Then, we can estimate any rating by applying:

$\hat{r}_{u, i} = \bar{r} + bu_{u} + bi_{i} + \sum_{f=1}^{F} P_{u, f} * Q_{i, f}$

The learning step consists in performing the SGD algorithm where for each known rating the biases and latent factors are updated as follows:

$err = r - \hat{r}$

$bu_{u} = bu_{u} + \alpha * (err - \lambda * bu_{u})$

$bi_{i} = bi_{i} + \alpha * (err - \lambda * bi_{i})$

$P_{u, f} = P_{u, f} + \alpha * (err * Q_{i, f} - \lambda * P_{u, f})$

$Q_{i, f} = Q_{i, f} + \alpha * (err * P_{u, f} - \lambda * Q_{i, f})$

where alpha is the learning rate and lambda is the regularization term.

References

License

MIT license, see here.

Comments

Documentation: Difference between this and "regular" SVD
As Wikipedia had suggested, Funk et. al. is not using the regular form of SVD with three variable, but rather a modified form of NMF with two variables but faster convergence and speed.

With reference to the Funk:

https://www.wikiwand.com/en/Matrix_factorization_(recommender_systems)#/Funk_MF

With reference to SVD:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html

Would it be good to explain how the optimization differs from NMF?
opened by BrandonKMLee 3
val metrics storing,
ChangeLog:

validation metrics are computed as soon as a validation set is provided (regardless of the early_stopping argument)

store validation metrics during epochs iterations in SVD.metrics_ (I love to monitor algo convergence :). Retrieve a Pandas DataFrame with SVD.get_val_metrics()

add min_delta argument in fit()

some picky syntax cleaning
opened by sicotfre 3
Documentation: Dataset shape requirements
For SVD to be usable in other settings, it would be useful to describe the acceptable datatype for SVD.

accepted inputs: From reading this as part of the MovieLens dataset we can safely assume that it follows the user-item-ratings format, while the testing tool uses the user-item format. These format specifications should be well documented for alternative usage (due to dataframe columns having the possible issue of misordering).

expanding dataset: Since the ID hash table is done within the SVD object, every time the data expands, the SVD object would also have to be rebuilt from scratch instead of expanding _preprocess_data fields accordingly.

re-fitting: If there are data that is given for updating, the SVD latent factor matrices, and bias matrices should be updated from the new data. However, whether the updating procedures should include times 1 to X-1 or not in time step X, is not documented.

caching to external object: Since sometimes the SVD has to go offline, caching the object is necessary. It would be useful to know if it is better to store the data in CSVs (for readability) or Pickle/DataTables (for legibility).

Explanation of feature: It should be noted that, unlike other base repos that utilize dense or sparse matrices, this repo is more akin to machine learning, and that it does not "update by the table" but rather "update by the entry". Also an explanation of validation/test dataset is needed (especially in a practical non-evaluation setting). https://www.wikiwand.com/en/Training,_validation,_and_test_sets
opened by BrandonKMLee 1
TypeError: 'Series' objects are mutable, thus they cannot be hashed

When run benchmark.ipynb, got the following error:

TypeError Traceback (most recent call last) in 8 pool = mp.Pool(processes=n_process) 9 ---> 10 df_splitted = [df.query('u_id.isin(@users_subset)') for users_subset in np.array_split(users, n_process)] 11 12 results = [

in (.0) 8 pool = mp.Pool(processes=n_process) 9 ---> 10 df_splitted = [df.query('u_id.isin(@users_subset)') for users_subset in np.array_split(users, n_process)] 11 12 results = [

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs) 3467 kwargs["level"] = kwargs.pop("level", 0) + 1 3468 kwargs["target"] = None -> 3469 res = self.eval(expr, **kwargs) 3470 3471 try:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs) 3597 kwargs["resolvers"] = kwargs.get("resolvers", ()) + tuple(resolvers) 3598 -> 3599 return _eval(expr, inplace=inplace, **kwargs) 3600 3601 def select_dtypes(self, include=None, exclude=None) -> DataFrame:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace) 345 eng = ENGINES[engine] 346 eng_inst = eng(parsed_expr) --> 347 ret = eng_inst.evaluate() 348 349 if parsed_expr.assigner is None:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in evaluate(self) 71 72 # make sure no names in resolvers and locals/globals clash ---> 73 res = self._evaluate() 74 return reconstruct_object( 75 self.result_type, res, self.aligned_axes, self.expr.terms.return_type

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in _evaluate(self) 111 env = self.expr.env 112 scope = env.full_scope --> 113 _check_ne_builtin_clash(self.expr) 114 return ne.evaluate(s, local_dict=scope) 115

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in _check_ne_builtin_clash(expr) 27 Terms can contain 28 """ ---> 29 names = expr.names 30 overlap = names & _ne_builtins 31

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/expr.py in names(self) 823 """ 824 if is_term(self.terms): --> 825 return frozenset([self.terms.name]) 826 return frozenset(term.name for term in com.flatten(self.terms)) 827

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in hash(self) 1783 1784 def hash(self) -> int: -> 1785 raise TypeError( 1786 f"{repr(type(self).name)} objects are mutable, " 1787 f"thus they cannot be hashed"

TypeError: 'Series' objects are mutable, thus they cannot be hashed

opened by htcml 1
(Feature) Recommend top N items given a user

Querying every item on the table would be inefficient. There could be a way to make this process simpler by multiplying a user's vector with the whole item matrix. Guessing: np.dot(self.pu_[u_ix], self.qi_) + self.bi_+ self.bu_[u_ix]

opened by BrandonKMLee 8
Including method SVD.get_utility_matrix

A function to get the utility matrix with users as rows and items as columns was added to the SVD class. This matrix may be useful for other matrix factorization methods. A test for the function was developed in the run_experiment.py script.

opened by guedes-joaofelipe 4
verbose mode

Hi. It would be nice in svd.fit() to have a verbose mode, where train- and validation error would be printed in every iteration. Currently only validation error is printed and even that only in the case, when early_stop==True.

opened by YitzhakSp 1

Owner

Geoffrey Bolmier

GitHub

Python implementation of the rulefit algorithm

RuleFit Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF) The algorithm can be used f

326 Jan 2, 2023

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment to test the algorithm

59 Dec 9, 2022

Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

90 Oct 28, 2022

Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https:

4 Dec 30, 2022

High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

174 Dec 7, 2022

Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Naive-Bayes Spam Classificator Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm. Main goal is to code a

1 Jun 27, 2022

Send rockets to Mars with artificial intelligence(Genetic algorithm) in python.

Send Rockets To Mars With AI Send rockets to Mars with artificial intelligence(Genetic algorithm) in python. Tools Python 3 EasyDraw How to Play Insta

3 Jul 15, 2022

Decision Tree Regression algorithm implemented on Python from scratch.

Decision_Tree_Regression I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when

1 Dec 22, 2021

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

6 Dec 4, 2022

BASTA: The BAyesian STellar Algorithm

BASTA: BAyesian STellar Algorithm Current stable version: v1.0 Important note: BASTA is developed for Python 3.8, but Python 3.7 should work as well.

16 Nov 15, 2022

using Machine Learning Algorithm to classification AppleStore application

AppleStore-classification-with-Machine-learning-Algo- using Machine Learning Algorithm to classification AppleStore application. the first step : 1: p

2 May 2, 2022

The project's goal is to show a real world application of image segmentation using k means algorithm

2 Jan 22, 2022

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

6.9k Jan 5, 2023

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

924 Jan 3, 2023

ThunderSVM: A Fast SVM Library on GPUs and CPUs

What's new We have recently released ThunderGBM, a fast GBDT and Random Forest library on GPUs. add scikit-learn interface, see here Overview The miss

1.4k Dec 22, 2022

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

14.5k Jan 7, 2023

A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

Related tags

Overview

⚡ funk-svd

Installation

Contributing

Quick example

Funk SVD for recommendation in a nutshell

References

License

Comments

Documentation: Difference between this and "regular" SVD

val metrics storing,

Documentation: Dataset shape requirements

TypeError: 'Series' objects are mutable, thus they cannot be hashed

(Feature) Recommend top N items given a user

Including method SVD.get_utility_matrix

verbose mode

Owner

Geoffrey Bolmier

Python implementation of the rulefit algorithm

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

Implementation of K-Nearest Neighbors Algorithm Using PySpark

High performance implementation of Extreme Learning Machines (fast randomized neural networks).

Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Send rockets to Mars with artificial intelligence(Genetic algorithm) in python.

Decision Tree Regression algorithm implemented on Python from scratch.

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

BASTA: The BAyesian STellar Algorithm

using Machine Learning Algorithm to classification AppleStore application

The project's goal is to show a real world application of image segmentation using k means algorithm

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

ThunderSVM: A Fast SVM Library on GPUs and CPUs

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Greykite: A flexible, intuitive and fast forecasting library

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

Related tags

Overview

⚡ funk-svd

Installation

Contributing

Quick example

Funk SVD for recommendation in a nutshell

References

License

Comments

Documentation: Difference between this and "regular" SVD

val metrics storing,

Documentation: Dataset shape requirements

TypeError: 'Series' objects are mutable, thus they cannot be hashed

(Feature) Recommend top N items given a user

Including method SVD.get_utility_matrix

verbose mode

Owner

Geoffrey Bolmier

Python implementation of the rulefit algorithm

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

Implementation of K-Nearest Neighbors Algorithm Using PySpark

High performance implementation of Extreme Learning Machines (fast randomized neural networks).

Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Send rockets to Mars with artificial intelligence(Genetic algorithm) in python.

Decision Tree Regression algorithm implemented on Python from scratch.

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

BASTA: The BAyesian STellar Algorithm

using Machine Learning Algorithm to classification AppleStore application

The project's goal is to show a real world application of image segmentation using k means algorithm

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

ThunderSVM: A Fast SVM Library on GPUs and CPUs

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

ThunderGBM: Fast GBDTs and Random Forests on GPUs

﻿Greykite: A flexible, intuitive and fast forecasting library

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Greykite: A flexible, intuitive and fast forecasting library