🌊 River is a Python library for online machine learning.

Overview

river_logo


tests documentation roadmap pypi pepy bsd_3_license


River is a Python library for online machine learning. It is the result of a merger between creme and scikit-multiflow. River's ambition is to be the go-to library for doing machine learning on streaming data.

⚡️ Quickstart

As a quick example, we'll train a logistic regression to classify the website phishing dataset. Here's a look at the first observation in the dataset.

>>> from pprint import pprint
>>> from river import datasets

>>> dataset = datasets.Phishing()

>>> for x, y in dataset:
...     pprint(x)
...     print(y)
...     break
{'age_of_domain': 1,
 'anchor_from_other_domain': 0.0,
 'empty_server_form_handler': 0.0,
 'https': 0.0,
 'ip_in_url': 1,
 'is_popular': 0.5,
 'long_url': 1.0,
 'popup_window': 0.0,
 'request_from_other_domain': 0.0}
True

Now let's run the model on the dataset in a streaming fashion. We sequentially interleave predictions and model updates. Meanwhile, we update a performance metric to see how well the model is doing.

>>> from river import compose
>>> from river import linear_model
>>> from river import metrics
>>> from river import preprocessing

>>> model = compose.Pipeline(
...     preprocessing.StandardScaler(),
...     linear_model.LogisticRegression()
... )

>>> metric = metrics.Accuracy()

>>> for x, y in dataset:
...     y_pred = model.predict_one(x)      # make a prediction
...     metric = metric.update(y, y_pred)  # update the metric
...     model = model.learn_one(x, y)      # make the model learn

>>> metric
Accuracy: 89.20%

🛠 Installation

River is intended to work with Python 3.6 or above. Installation can be done with pip:

pip install river

There are wheels available for Linux, MacOS, and Windows, which means that you most probably won't have to build River from source.

You can install the latest development version from GitHub as so:

pip install git+https://github.com/online-ml/river --upgrade

Or, through SSH:

pip install git+ssh://[email protected]/online-ml/river.git --upgrade

🧠 Philosophy

Machine learning is often done in a batch setting, whereby a model is fitted to a dataset in one go. This results in a static model which has to be retrained in order to learn from new data. In many cases, this isn't elegant nor efficient, and usually incurs a fair amount of technical debt. Indeed, if you're using a batch model, then you need to think about maintaining a training set, monitoring real-time performance, model retraining, etc.

With River, we encourage a different approach, which is to continuously learn a stream of data. This means that the model process one observation at a time, and can therefore be updated on the fly. This allows to learn from massive datasets that don't fit in main memory. Online machine learning also integrates nicely in cases where new data is constantly arriving. It shines in many use cases, such as time series forecasting, spam filtering, recommender systems, CTR prediction, and IoT applications. If you're bored with retraining models and want to instead build dynamic models, then online machine learning (and therefore River!) might be what you're looking for.

Here are some benefits of using River (and online machine learning in general):

  • Incremental: models can update themselves in real-time.
  • Adaptive: models can adapt to concept drift.
  • Production-ready: working with data streams makes it simple to replicate production scenarios during model development.
  • Efficient: models don't have to be retrained and require little compute power, which lowers their carbon footprint
  • Fast: when the goal is to learn and predict with a single instance at a time, then River is an order of magnitude faster than PyTorch, Tensorflow, and scikit-learn.

🔥 Features

  • Linear models with a wide array of optimizers
  • Nearest neighbors, decision trees, naïve Bayes
  • Progressive model validation
  • Model pipelines as a first-class citizen
  • Anomaly detection
  • Recommender systems
  • Time series forecasting
  • Imbalanced learning
  • Clustering
  • Feature extraction and selection
  • Online statistics and metrics
  • Built-in datasets
  • And much more

🔗 Useful links

👁️ Media

👍 Contributing

Feel free to contribute in any way you like, we're always open to new ideas and approaches.

There are three ways for users to get involved:

  • Issue tracker: this place is meant to report bugs, request for minor features, or small improvements. Issues should be short-lived and solved as fast as possible.
  • Discussions: you can ask for new features, submit your questions and get help, propose new ideas, or even show the community what you are achieving with River! If you have a new technique or want to port a new functionality to River, this is the place to discuss.
  • Roadmap: you can check what we are doing, what are the next planned milestones for River, and look for cool ideas that still need someone to make them become a reality!

Please check out the contribution guidelines if you want to bring modifications to the code base. You can view the list of people who have contributed here.

❤️ They've used us

These are companies that we know have been using River, be it in production or for prototyping.

companies

Feel welcome to get in touch if you want us to add your company logo!

🤝 Affiliations

Sponsors

sponsors

Collaborating institutions and groups

collaborations

💬 Citation

If river has been useful for your research and you would like to cite it in an scientific publication, please refer to this paper:

@misc{2020river,
      title={River: machine learning for streaming data in Python},
      author={Jacob Montiel and Max Halford and Saulo Martiello Mastelini
              and Geoffrey Bolmier and Raphael Sourty and Robin Vaysse
              and Adil Zouitine and Heitor Murilo Gomes and Jesse Read
              and Talel Abdessalem and Albert Bifet},
      year={2020},
      eprint={2012.04740},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

📝 License

River is free and open-source software licensed under the 3-clause BSD license.

Comments
  • refactoring neighbors models to use simple collections queue

    refactoring neighbors models to use simple collections queue

    This is a follow up to discussion in https://github.com/online-ml/river/issues/891#issuecomment-1080993289.

    The current implementation of nearest neighbors was done for a performance tweak, but it is limited because we cannot be flexible to add slightly varying features (the vector size for X is fixed and you get an error if you vary from the first one) or to add additional metadata about a point in the window such as a UID that can be used to look up a point later. From an online ML standpoint (as I understand it) we should optimize for this kind of flexibility over a small performance tweak (but coming from the world of HPC I totally get the performance bit!)

    This refactor takes the base logic from creme.neighbors, and adds some additional features,

    1. a minimum distance to consider adding a new datum to the window, ensuring we have a more variable window that does better at learning. The default for my model was 0.05 because I had ~250K points with quite a bit of repetition so adding redundancy was expensive and led to bad results, but I chose a default of 0.0 here so the default miimcs what you would expect.
    2. Given the window can change (and classes within it) I realized that the prediction dict would potentially have a lot of extra "no longer existing in the window" classes. So I added a class cleanup function for classification to iterate over data in the window and ensure the classes known to the class accurately reflect the current window. The default is False so this will not run (and the model maintains all memory of classes it has seen and returns 0 probability of the class) but it can be set to True to always run after learn (potentially at the cost of performance) or can be run on demand (e.g., I can see an online-ml server setting it to False, and then loading and running once a day or something like that.
    3. optional support to add a UID, which is just another variable in the list added to the queue! This I've found incredibly useful in production cases where you do not just want a prediction back, but you want to look more into the closest points, metadata wise. The data in the window should be useful beyond having x , y, and class, and allowing an additional identifier allows the implementer to put more information elsewhere.

    For the reasons in 3, I renamed the BaseNeighbors class to be KNeighbors because people can use it as a very simple model to return the neighbors verbatim.

    This is likely a start (and I have not run tests locally) so we can discuss further changes that need to be done, a strategy for doing them, and what features / additional classes of the old implementation we want to preserve. I had this on another branch, but for some reason running pre-commit made changes outside of my changes (perhaps a different version of black or flake8 or something?) so I created a fresh clone and added the changed files verbatim.

    Signed-off-by: vsoch [email protected]

    opened by vsoch 63
  • Using Creme on non scikit learn dataset.

    Using Creme on non scikit learn dataset.

    Hi @MaxHalford ! Apologies for the beginner question.

    How do we use creme on non scikit learn data sets. Bascilly I have two dictionaries.

    X=

    {(Timestamp('2012-01-10 00:00:00'), 'volume_change_ratio', 'AAPL'): -0.344719768623466, (Timestamp('2012-01-10 00:00:00'), 'volume_change_ratio', 'CSCO'): 0.20302817925763325, (Timestamp('2012-01-10 00:00:00'), 'volume_change_ratio', 'INTC'): -0.13517037149368347,

    And Y =

    {(Timestamp('2012-01-10 00:00:00'), 'AAPL'): -0.0013231888852133222, (Timestamp('2012-01-10 00:00:00'), 'CSCO'): 0.005841741901221553, (Timestamp('2012-01-10 00:00:00'), 'INTC'): -0.006252442360296984, (Timestamp('2012-01-10 00:00:00'), 'MSFT'): -0.014727011494252928, (Timestamp('2012-01-10 00:00:00'), 'SPY'): -0.003097653527452948, (Timestamp('2012-01-11 00:00:00'), 'AAPL'): -0.0004970178926441138, (Timestamp('2012-01-11 00:00:00'), 'CSCO'): 0.002621919244887305, (Timestamp('2012-01-11 00:00:00'), 'INTC'): 0.0019379844961240345,

    And I want to do a basic iterative linear regression on the data sets.

    Best, Andrew

    opened by andrewczgithub 34
  • Numpy import error with River

    Numpy import error with River

    Hi,

    I read through all previous issues to make sure i have the right versions. I have:

    • numpy 1.20.1
    • river 0.10.1

    However I still got the "ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 232 from C header, got 216 from PyObject"

    What should I do? Thank you!

    opened by mai-n-coleman 32
  • Bandits regressors for model selection (new PR to use Github CI/CD)

    Bandits regressors for model selection (new PR to use Github CI/CD)

    Description

    The PR introduce bandits (epsilon-greedy and UCB) for model selection (see issue #270 ). The PR concerns only regressors, but I can add the classifiers in a subsequent PR.

    The use of the classes are straightforward :

    bandit = UCBRegressor(models=models, metric=metrics.MSE()),
    
    for (x, y) in data.take(N):    
            y_pred = bandit.predict_one(x=x)
            bandit.learn_one(x=x, y=y)
    
    best_model = bandit.best_model
    

    There are convenience methods such as :

    • percentage_pulled : to get the percentage each arm was pulled
    • best_model : return the model with the highest average reward

    Also I added a method add_models where the user can add models on the fly.

    I am also working on a notebook that studies the behavior of the bandits for model selection. The notebook also include Exp3, which seems promising but has numerical stability issue and yields counter-intuitive results (see section 3 of the NB). That's why I kept it out of this PR. More generally, the performances of UCB and epsilon-greedy are rather good but there seems to be some variance in the performance.

    Improvements

    It's still WIP on the following points :

    • docstring, mainly add examples + cleaning.
    • some comments might be removed.
    • the name of the classes and the methods are open for changes
    Feature 
    opened by etiennekintzler 25
  • Implement bandits

    Implement bandits

    We now have SuccessiveHalvingClassifier and SuccessiveHalvingRegressor in the model_selection to module to perform, well, model selection. This allows doing hyperparameter-tuning by initializing a model with different parameter configuration and running them against each other. All in all it seems to be working pretty well and we should be getting some feedback on it soon. The implementations are handy because they implement the fit_one/predict_one interface and therefore make the whole process transparent to users. In other words you can use them as you would any other model. This design will go a long way and should make things simple in terms of deployment (I'm thinking of you chantilly).

    The next step would be to implement multi-armed bandits. In progressive validation, all the remaining models are updated. This is called a "full feedback" situation. Bandits, on the other hand, use partial feedback, because only one model is picked and trained at a time. This is more efficient because it results in less model evaluations, but might also converge more slowly. Most bandit algorithms assume that the performance of the models is constant through time (this is called "stationarity"). However, the performance of each model is bound to change through time because the act of picking modifies the model. Therefore ideally we need to looking into non-stationary bandit algorithms. Here are some more details and references.

    Here are the algortithms I would like to see implemented:

    Also see this paper. We can probably open separate issues for each implementation. I think that the current best way of proceding is to provide one implementation for regression and one for classification in each case, much like what is done for successive halving. There's also this paper that discusses delayed feedback and how it affects bandits.

    Feature 
    opened by MaxHalford 24
  • Error when using OutputCodeClassifier for Code-size greater than 20

    Error when using OutputCodeClassifier for Code-size greater than 20

    Hello again,

    I was testing the OCC classifier with more than 90 classes and the accuracy is very poor. I assume I need a huge code size, however I was testing different code-sizes (staring with a code-size of 10) and recording the accuracy when I came to a code-size of 40 and received the following error: OverflowError: Python int too large to convert to C ssize_t. Is there a way we can modify the occ classifier to allow for compact codes (as short as possible) while still providing enough discriminating power between the different classes.

    opened by Yasmen-Wahba 23
  • Adaptive Random Forest Regressor/Hoeffding Tree Regressor splitting gives AttributeError

    Adaptive Random Forest Regressor/Hoeffding Tree Regressor splitting gives AttributeError

    Versions

    river version: River 0.1.0 Python version:Python 3.7.1 Operating system: Windows 10 Enterprise v1803

    Describe the issue

    Have been playing around with River/Creme for a couple of weeks now, and it's super useful for a project I am currently working on. That being said, I'm still pretty new to the workings of the algorithms, so unsure whether this is a bug or an issue with my setup.

    When I call learn_one on the ARFR or HTR, I receive: "Attribute Error "NoneType" object has no attribute '_left'" from line 113 in best_evaluated_split_suggestion.py.

    I have implemented a delayed prequential evaluation algorithm, and inspecting the loop, the error seems to be thrown when the model after the first delay period has been exceeded - ie when the model can first make a prediction that isn't zero. Before this point, learn_one doesn't throw this error.

    Currently, I am using the default ARFR as in the example, with a simple Linear Regression as the leaf_model. The linear regression model itself has been working with my data, when not used in the ARFR. I want to try the ensemble/tree models with the data to see if accuracy is improved due to the drift detection methods that are included.

    Has anyone also seen this error being thrown, or know the causes of it? Let me know if more information is needed! Thanks.

    Bug 
    opened by JPalm1 23
  • Feature/l1 implementation

    Feature/l1 implementation

    Yo @MaxHalford check out this

    [Thanks to @gbolmier for inspiration]

    Addresses #618 I've decided to implement L1 cumulative only because:

    • L1 naive is just not efficient plus will likely take more hassle than it should (e.g. sign method for VectorDict)
    • Truncateed Gradient approach acc to paper is overshadowed by L1 cumulative plus takes 3 params to tune instead of 1 (no thanks, as if online model tuning is not tricky enough)

    Done some tests and showcase: https://gist.github.com/ColdTeapot273K/b7865bbfb9ad2e473b474c39c3d40413

    Bonus:

    • works with learn_many out of the box
    • appears to be better than scikit-learn's impl of l1 on SGDRegressor (i think it uses truncated under the hood?)

    As you might notice this implementation implies either l1 or l2 is used at a time. A small price to pay for the ability to finally do proper, optimal sparse feature selection.

    Got problems with vectorizing the penalty. Didn't see a graceful way of imitating boolean indexing in VectorDict or at least a way to transfer from numpy restricted computation domain into VectorDict. So decided to just leave vectorized version as a dry-run thing with a test attached (see gist), maybe you'll have better ideas (@MaxHalford you should have the access to my forks).

    Cheers.

    opened by ColdTeapot273K 22
  • Using Creme model in Python 2

    Using Creme model in Python 2

    Hi Guys,

    Suppose, I trained a logistic-regression classifier using Python 3.x as Creme only supports Python 3. How can I use the model for classification in Python 2?

    opened by aamirkhan34 22
  • Question about centroid changes

    Question about centroid changes

    heyo! So this might be better suited for chantilly, but it's generally about river so I hope it's okay to put here. I'm creating a server that has a cluster model on the backend to handle clustering types of errors. I was planning on using a cluster model and then assigning new errors (saved as objects in the database) to clusters, but I realize if I have a preference for models with a changing number of centers then my assignments would also likely change over time (and not be valid). This makes the idea of saving the cluster id with the error object not a great idea. But then I was thinking - as long as we have some kind of ability for a model to output changes (e.g.,:

    • Cluster 8 no longer exists
    • Clusters 10 and 12 were merged into 10
    • Cluster 5 was split into 5 and 101

    Then it could be feasible to act on those events, and in the case of the above:

    • Remove the assignment of any errors to cluster 8 (reclassify if it's computationally easy)
    • find objects assigned to 10 and 12, assign all to 10
    • Remove the assignment of 5, and reclassify if feasible

    So my question - how do we integrate this functionality into river, at least exposing enough for a wrapper of some kind to assess changes? Or is it just a bad design idea on my part?

    opened by vsoch 21
  • Timeseries forecast evaluation

    Timeseries forecast evaluation

    2 matters here:

    1. after introducing the ThersholdFilter in the time_series pipeline, it is now hard to evaluate models with vectors that have anomalies. Is it possible to "pass-through" the anomaly score 1/0 to the output of the function model.forecast() ? this way one could skip metrics.update for those anomalies when looping around the dataset.

    2. for a-posteriori evaluation of a time_series forecaster, i am try to replay the dataset with a pre-trained model and see how that model would have performed for "reconstructing" the whole dataset. to make this i would simply need to call model.forecast() using values in the past. but is this possible? how should i call the forecast method? And more in general, what is the difference between horizon, and xs in the forecast signature?

    opened by dberardo-com 18
  • venv caching for faster CI

    venv caching for faster CI

    As discussed with @MaxHalford on Discord, caching the virtual environment allows us to bypass completely pip installs. The existing approach caches downloaded pip packages, although pip install takes considerable amounts of time even then.

    opened by boragokbakan 0
  • Onelearn implementation: AMF & Mondrian Tree Classifiers

    Onelearn implementation: AMF & Mondrian Tree Classifiers

    Description of the PR

    Hi ! 👋

    This is a first version of Onelearn's library (classifiers only) implementation in River. It contains:

    • Mondrian Tree (Base and Classifier)
    • Aggregated Mondrian Forest (Base and Classifier)

    The original repository with proof of working implementation can be found here (see script.py).

    image

    Known drawbacks of the current implementation

    • Management of labels: labels must be positive integers at the moment, since it just makes my life easier. Please tell me how you would prefer this to be implemented in River's framework (I've seen labels presented as dictionary, but I'm not sure what these dictionaries contains: int? string?). This might be a simple trick for me to change how labels are managed I think, just need to know your preferences.
    • Tree branches: I use branches mainly as the global tree structure, I'm not quite exactly sure how to use them better. Feel free to comment on that too if you have better suggestions !
    • Examples of usage: MondrianTreeClassifier and AMFClassifier would need examples of implementations for users. I actually implemented one on my repo here already, but I didn't manage to compile River so I couldn't get the scores 😢
    • Random state: I have a random state attribute, but it's not used at the moment. I'm not exactly sure where to put the random state in the Mondrian process, I'm afraid it'd be breaking the whole thing. If any expert in Random Forest could advise me on where random state should be placed, that would be great ❤️

    Notes on the utils

    Currently I placed two functions in the utils section:

    • sample_discrete
    • log_sum_2_exp

    It might seem overkill to place them as utils right now looking at where they're used in the code, but I'll need them for the regressors too when the times come. Maybe there's a better place for them keeping the regressors in mind though.

    opened by AlexandreChaussard 6
  • add xstream

    add xstream

    We are currently working to implement PySAD into River, for the "polytechnique-project". It is a pull request for adding the method "xstream".

    Sophie Normand

    opened by Sophie-Normand 0
  • Refactor benchmarks

    Refactor benchmarks

    Hi 👋,

    I spend some time on the benchmarks and tried to refactor them. I used vega for visualising the performances. This requires the mkdocs-charts-plugin which sometimes fails for the livedocs when reloading the documentation. Further, the index.md in the benchmark folder might get pretty big at some point.

    As there are some drawbacks, I would like some feedback from you to see if I am on the right track.

    Best, Cedric image

    Improvement 
    opened by kulbachcedric 6
Releases(0.14.0)
Owner
OnlineML
Online machine learning in Python
OnlineML
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
MLBox is a powerful Automated Machine Learning python library.

MLBox is a powerful Automated Machine Learning python library. It provides the following features: Fast reading and distributed data preprocessing/cle

Axel 1.4k Jan 6, 2023
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 5, 2023
Uber Open Source 1.6k Dec 31, 2022
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

null 152 Jan 2, 2023
Model Agnostic Confidence Estimator (MACEST) - A Python library for calibrating Machine Learning models' confidence scores

Model Agnostic Confidence Estimator (MACEST) - A Python library for calibrating Machine Learning models' confidence scores

Oracle 95 Dec 28, 2022
A Python library for choreographing your machine learning research.

A Python library for choreographing your machine learning research.

AI2 270 Jan 6, 2023
Python Automated Machine Learning library for tabular data.

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Scie

Daniel Khromov 47 Dec 17, 2022
SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

SageMaker Python SDK SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker. With the S

Amazon Web Services 1.8k Jan 1, 2023
Arquivos do curso online sobre a estatística voltada para ciência de dados e aprendizado de máquina.

Estatistica para Ciência de Dados e Machine Learning Arquivos do curso online sobre a estatística voltada para ciência de dados e aprendizado de máqui

Renan Barbosa 1 Jan 10, 2022
cuML - RAPIDS Machine Learning Library

cuML - GPU Machine Learning Algorithms cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions t

RAPIDS 3.1k Dec 28, 2022
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4.2k Jan 1, 2023
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

null 114 Jul 19, 2022
Pandas Machine Learning and Quant Finance Library Collection

Pandas Machine Learning and Quant Finance Library Collection

null 148 Dec 7, 2022
Machine learning template for projects based on sklearn library.

Machine learning template for projects based on sklearn library.

Janez Lapajne 17 Oct 28, 2022
Pytools is an open source library containing general machine learning and visualisation utilities for reuse

pytools is an open source library containing general machine learning and visualisation utilities for reuse, including: Basic tools for API developmen

BCG Gamma 26 Nov 6, 2022