A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

Overview

funk-svd Build Status License

funk-svd is a Python 3 library implementing a fast version of the famous SVD algorithm popularized by Simon Funk during the Neflix Prize contest.

Numba is used to speed up our algorithm, enabling us to run over 10 times faster than Surprise's Cython implementation (cf. benchmark notebook).

Movielens 20M RMSE MAE Time
Surprise 0.88 0.68 10 min 40 sec
Funk-svd 0.88 0.68 42 sec

Installation

Run pip install git+https://github.com/gbolmier/funk-svd in your terminal.

Contributing

All contributions, bug reports, bug fixes, enhancements, and ideas are welcome.

A detailed overview on how to contribute can be found in the contributor guide.

Quick example

run_experiment.py:

>>> from funk_svd.dataset import fetch_ml_ratings
>>> from funk_svd import SVD

>>> from sklearn.metrics import mean_absolute_error


>>> df = fetch_ml_ratings(variant='100k')

>>> train = df.sample(frac=0.8, random_state=7)
>>> val = df.drop(train.index.tolist()).sample(frac=0.5, random_state=8)
>>> test = df.drop(train.index.tolist()).drop(val.index.tolist())

>>> svd = SVD(lr=0.001, reg=0.005, n_epochs=100, n_factors=15,
...           early_stopping=True, shuffle=False, min_rating=1, max_rating=5)

>>> svd.fit(X=train, X_val=val)
Preprocessing data...

Epoch 1/...

>>> pred = svd.predict(test)
>>> mae = mean_absolute_error(test['rating'], pred)

>>> print(f'Test MAE: {mae:.2f}')
Test MAE: 0.75

Funk SVD for recommendation in a nutshell

We have a huge sparse matrix:

storing known ratings for a set of users and items:

The idea is to estimate unknown ratings by factorizing the rating matrix into two smaller matrices representing user and item characteristics:

We call these two matrices users and items latent factors. Then, by applying the dot product between both matrices we can reconstruct our rating matrix. The trick is that the empty values will now contain estimated ratings.

In order to get more accurate results, the global average rating as well as the user and item biases are used in addition:

where K stands for known ratings.

Then, we can estimate any rating by applying:

The learning step consists in performing the SGD algorithm where for each known rating the biases and latent factors are updated as follows:

where alpha is the learning rate and lambda is the regularization term.

References

License

MIT license, see here.

Comments
  • Documentation: Difference between this and

    Documentation: Difference between this and "regular" SVD

    As Wikipedia had suggested, Funk et. al. is not using the regular form of SVD with three variable, but rather a modified form of NMF with two variables but faster convergence and speed.

    With reference to the Funk:

    • https://www.wikiwand.com/en/Matrix_factorization_(recommender_systems)#/Funk_MF

    With reference to SVD:

    • https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    • https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html

    Would it be good to explain how the optimization differs from NMF?

    opened by BrandonKMLee 3
  • val metrics storing,

    val metrics storing,

    ChangeLog:

    • validation metrics are computed as soon as a validation set is provided (regardless of the early_stopping argument)
    • store validation metrics during epochs iterations in SVD.metrics_ (I love to monitor algo convergence :). Retrieve a Pandas DataFrame with SVD.get_val_metrics()
    • add min_delta argument in fit()
    • some picky syntax cleaning
    opened by sicotfre 3
  • Documentation: Dataset shape requirements

    Documentation: Dataset shape requirements

    For SVD to be usable in other settings, it would be useful to describe the acceptable datatype for SVD.

    1. accepted inputs: From reading this as part of the MovieLens dataset we can safely assume that it follows the user-item-ratings format, while the testing tool uses the user-item format. These format specifications should be well documented for alternative usage (due to dataframe columns having the possible issue of misordering).
    2. expanding dataset: Since the ID hash table is done within the SVD object, every time the data expands, the SVD object would also have to be rebuilt from scratch instead of expanding _preprocess_data fields accordingly.
    3. re-fitting: If there are data that is given for updating, the SVD latent factor matrices, and bias matrices should be updated from the new data. However, whether the updating procedures should include times 1 to X-1 or not in time step X, is not documented.
    4. caching to external object: Since sometimes the SVD has to go offline, caching the object is necessary. It would be useful to know if it is better to store the data in CSVs (for readability) or Pickle/DataTables (for legibility).
    5. Explanation of feature: It should be noted that, unlike other base repos that utilize dense or sparse matrices, this repo is more akin to machine learning, and that it does not "update by the table" but rather "update by the entry". Also an explanation of validation/test dataset is needed (especially in a practical non-evaluation setting). https://www.wikiwand.com/en/Training,_validation,_and_test_sets
    opened by BrandonKMLee 1
  • TypeError: 'Series' objects are mutable, thus they cannot be hashed

    TypeError: 'Series' objects are mutable, thus they cannot be hashed

    When run benchmark.ipynb, got the following error:


    TypeError Traceback (most recent call last) in 8 pool = mp.Pool(processes=n_process) 9 ---> 10 df_splitted = [df.query('u_id.isin(@users_subset)') for users_subset in np.array_split(users, n_process)] 11 12 results = [

    in (.0) 8 pool = mp.Pool(processes=n_process) 9 ---> 10 df_splitted = [df.query('u_id.isin(@users_subset)') for users_subset in np.array_split(users, n_process)] 11 12 results = [

    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in query(self, expr, inplace, **kwargs) 3467 kwargs["level"] = kwargs.pop("level", 0) + 1 3468 kwargs["target"] = None -> 3469 res = self.eval(expr, **kwargs) 3470 3471 try:

    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in eval(self, expr, inplace, **kwargs) 3597 kwargs["resolvers"] = kwargs.get("resolvers", ()) + tuple(resolvers) 3598 -> 3599 return _eval(expr, inplace=inplace, **kwargs) 3600 3601 def select_dtypes(self, include=None, exclude=None) -> DataFrame:

    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/eval.py in eval(expr, parser, engine, truediv, local_dict, global_dict, resolvers, level, target, inplace) 345 eng = ENGINES[engine] 346 eng_inst = eng(parsed_expr) --> 347 ret = eng_inst.evaluate() 348 349 if parsed_expr.assigner is None:

    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in evaluate(self) 71 72 # make sure no names in resolvers and locals/globals clash ---> 73 res = self._evaluate() 74 return reconstruct_object( 75 self.result_type, res, self.aligned_axes, self.expr.terms.return_type

    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in _evaluate(self) 111 env = self.expr.env 112 scope = env.full_scope --> 113 _check_ne_builtin_clash(self.expr) 114 return ne.evaluate(s, local_dict=scope) 115

    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/engines.py in _check_ne_builtin_clash(expr) 27 Terms can contain 28 """ ---> 29 names = expr.names 30 overlap = names & _ne_builtins 31

    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/expr.py in names(self) 823 """ 824 if is_term(self.terms): --> 825 return frozenset([self.terms.name]) 826 return frozenset(term.name for term in com.flatten(self.terms)) 827

    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in hash(self) 1783 1784 def hash(self) -> int: -> 1785 raise TypeError( 1786 f"{repr(type(self).name)} objects are mutable, " 1787 f"thus they cannot be hashed"

    TypeError: 'Series' objects are mutable, thus they cannot be hashed

    opened by htcml 1
  • (Feature) Recommend top N items given a user

    (Feature) Recommend top N items given a user

    Querying every item on the table would be inefficient. There could be a way to make this process simpler by multiplying a user's vector with the whole item matrix. Guessing: np.dot(self.pu_[u_ix], self.qi_) + self.bi_+ self.bu_[u_ix]

    opened by BrandonKMLee 8
  • Including method SVD.get_utility_matrix

    Including method SVD.get_utility_matrix

    A function to get the utility matrix with users as rows and items as columns was added to the SVD class. This matrix may be useful for other matrix factorization methods. A test for the function was developed in the run_experiment.py script.

    opened by guedes-joaofelipe 4
  • verbose mode

    verbose mode

    Hi. It would be nice in svd.fit() to have a verbose mode, where train- and validation error would be printed in every iteration. Currently only validation error is printed and even that only in the case, when early_stop==True.

    opened by YitzhakSp 1
Owner
Geoffrey Bolmier
Geoffrey Bolmier
Python implementation of the rulefit algorithm

RuleFit Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF) The algorithm can be used f

Christoph Molnar 326 Jan 2, 2023
This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment to test the algorithm

Martin Huber 59 Dec 9, 2022
Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

IDLab Services 90 Oct 28, 2022
Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https:

Zachary Petroff 4 Dec 30, 2022
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 7, 2022
Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Naive-Bayes Spam Classificator Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm. Main goal is to code a

Viktoria Maksymiuk 1 Jun 27, 2022
Send rockets to Mars with artificial intelligence(Genetic algorithm) in python.

Send Rockets To Mars With AI Send rockets to Mars with artificial intelligence(Genetic algorithm) in python. Tools Python 3 EasyDraw How to Play Insta

Mohammad Dori 3 Jul 15, 2022
Decision Tree Regression algorithm implemented on Python from scratch.

Decision_Tree_Regression I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when

null 1 Dec 22, 2021
This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

Hazim Arafa 6 Dec 4, 2022
BASTA: The BAyesian STellar Algorithm

BASTA: BAyesian STellar Algorithm Current stable version: v1.0 Important note: BASTA is developed for Python 3.8, but Python 3.7 should work as well.

BASTA team 16 Nov 15, 2022
using Machine Learning Algorithm to classification AppleStore application

AppleStore-classification-with-Machine-learning-Algo- using Machine Learning Algorithm to classification AppleStore application. the first step : 1: p

Mohammed Hussien 2 May 2, 2022
The project's goal is to show a real world application of image segmentation using k means algorithm

The project's goal is to show a real world application of image segmentation using k means algorithm

null 2 Jan 22, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 5, 2023
Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

Miles Cranmer 924 Jan 3, 2023
ThunderSVM: A Fast SVM Library on GPUs and CPUs

What's new We have recently released ThunderGBM, a fast GBDT and Random Forest library on GPUs. add scikit-learn interface, see here Overview The miss

Xtra Computing Group 1.4k Dec 22, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 648 Dec 16, 2022
Greykite: A flexible, intuitive and fast forecasting library

The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite.

LinkedIn 1.7k Jan 4, 2023
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Robustness Gym 115 Dec 12, 2022