Framework and Library for Distributed Online Machine Learning

Overview

Jubatus

https://api.travis-ci.org/jubatus/jubatus.svg?branch=master

The Jubatus library is an online machine learning framework which runs in distributed environment.

See http://jubat.us/ for details.

Quick Start

We officially support Red Hat Enterprise Linux (RHEL) 6.2 or later (64-bit) and Ubuntu Server 14.04 LTS / 16.04 LTS / 18.04 LTS (64-bit). On supported systems, you can install all components of Jubatus using binary packages.

See QuickStart for detailed description.

Red Hat Enterprise Linux 6.2 or later (64-bit)

Run the following command to register Jubatus Yum repository to the system.

// For RHEL 6
$ sudo rpm -Uvh http://download.jubat.us/yum/rhel/6/stable/x86_64/jubatus-release-6-2.el6.x86_64.rpm

// For RHEL 7
$ sudo rpm -Uvh http://download.jubat.us/yum/rhel/7/stable/x86_64/jubatus-release-7-2.el7.x86_64.rpm

Then install jubatus and jubatus-client package.

$ sudo yum install jubatus jubatus-client

Now Jubatus is installed in /usr/bin/juba*.

$ jubaclassifier -f /usr/share/jubatus/example/config/classifier/pa.json

Ubuntu Server (64-bit)

Write the following line to /etc/apt/sources.list.d/jubatus.list to register Jubatus Apt repository to the system.

// For Ubuntu 12.04 (Precise) - Deprecated (unsupported)
deb http://download.jubat.us/apt/ubuntu/precise binary/

// For Ubuntu 14.04 (Trusty)
deb http://download.jubat.us/apt/ubuntu/trusty binary/

// For Ubuntu 16.04 (Xenial)
deb http://download.jubat.us/apt/ubuntu/xenial binary/

// For Ubuntu 18.04 (Bionic)
deb [trusted=yes] http://download.jubat.us/apt/ubuntu/bionic/binary /

Now install jubatus package.

$ sudo apt-get update
$ sudo apt-get install jubatus

Now Jubatus is installed in /opt/jubatus/bin/juba*.

$ source /opt/jubatus/profile
$ jubaclassifier -f /opt/jubatus/share/jubatus/example/config/classifier/pa.json

Other Platforms

For other platforms, refer to the documentation.

License

LGPL 2.1

Third-party libraries included in Jubatus

Jubatus source tree includes following third-party library.

  • cmdline (under BSD 3-Clause License)

Jubatus requires jubatus_core library. jubatus_core contains Eigen and fork of pficommon. Eigen is licensed under MPL2 (partially in LGPL 2.1 or 2.1+). The fork of pficommon is licensed under New BSD License.

Update history

Update history can be found from ChangeLog or WikiPage.

Contributors

Patches contributed by those people.

Comments
  • Test failure in Travis

    Test failure in Travis

    I've updated the travis config in 1fa7a9552cf3ff187af5bd2eb137f0d07c77475d and this somewhat improved the situation:

    test other stale 
    opened by kmaehashi 19
  • Eliminate RPC error 2

    Eliminate RPC error 2

    https://github.com/jubatus/jubatus/blob/develop/src/common/mprpc/rpc_server.cpp#L41

    Currently, all MessagePack-RPC clients (C++, Python, Ruby, Java) does not interpret msgpack::rpc::ARGUMENT_ERROR (it is actually a int value "2"). This causes meaningless "RPC Error 2" and it is too unfriendly for users.

    I think it's better to send back a string message like "Type mismatch error in argument" instead.

    jenerator improvement 
    opened by kmaehashi 17
  • servers: Refactoring divide server into ML module and RPC server

    servers: Refactoring divide server into ML module and RPC server

    Current ML tasks(classifier, recommender..) implemented in *_serv.cpp that tight coupled with server module.

    related #250, I commented refactoring step 1 and 2.

    refactoring 
    opened by suma 16
  • Requirements for Model Data Format

    Requirements for Model Data Format

    Currently, save method just dumps mixable data structures using pficommon serializer. Points of view that need to be considered may include interoperability with other software and upper compatibility (for example, including the version number).

    See also #222.

    improvement core 
    opened by kmaehashi 14
  • Improve get_status

    Improve get_status

    I imploved get_status method.

    Interface

    • Not Changed

    Changed

    • Don't return following keys in standalone mode
      • interval_sec
      • interval_count
      • zk
      • use_cht (0 or 1)
    • Add following keys
      • Key: type, Value: machine laerning type (e.g. classifier, anomaly ...)
      • Key: loglevel, Value: Command-line argument or default value (Converted to string like INFO, FATAL ...)
      • Key: logdir, Value: Command-line argument or ""(empty string) (It's default value)
      • Key: configpath, Value:
        • Standalone Mode: Command-line argument
        • Distributed Mode: Node path in zookeeper
      • Key: clock_time, Value: System clock time (epoch time) when receiving a request (sec)
      • Key: uptime, Value: How long the process has been running (sec)
      • Key: start_time, Value: System clock time (epoch time) when process started (sec)
      • Key: pid, Value: Process ID
      • Key: user, Value: User name who started process
      • Key: last_saved, Value: System clock time (epoch time) when saved model file at last
      • Key: last_saved_path, Value: Model file path which was saved at last
      • Key: last_loaded, Value: System clock time (epoch time) when loaded model file at last
      • Key: last_saved_path, Value: Model file path which was loaded at last
      • Followings are returned only in distributed mode
        • Key: name, Value: Command-line argument
        • Key: zookeeper_timeout, Value: Command-line argument or default value
        • Key: interconnect_timeout, Value: Command-line argument or default value
        • Key: connected_zookeeper, Value: Host and Port of zookeeper that process connected (e.g. 127.0.0.1_2181)
        • Key: mixer, Value: Command-line argument or default value
        • Status of mixer

    example

    In python, output of pprint is followings:

    • https://gist.github.com/rimms/8434705

    TBD (I will create new issues)

    • I want to change values of xxxpath to full-path.
    • We should reconsider about status of storages.

    Updated 2014/01/20 10:34: User ID -> User name Updated 2014/01/27 15:56: Remove join, Use empty string as default value of logdir, fixed typo (valu -> value)

    opened by rimms 13
  • Support import/export for model compatibility between different environments

    Support import/export for model compatibility between different environments

    Current save/load operations support the model reusability only on the same environment where the models are built. On the other hand, users might want to use the models on another environment. Since the saved models are lack of some information to reconstruct the model, especially about what only zookeeper knows but servers doesn't.

    For that purpose, we may make save/load more self-contained and compatible for running on the other environments, or, create new interface such as import/export to support such functionalities.

    discussion improvement server stale 
    opened by hido 12
  • Migrate from google-glog to log4cxx

    Migrate from google-glog to log4cxx

    Finally removes google-glog dependency! (fix #746)

    ~~Note that this pull-req depends on #807 and cannot be merged before #807.~~

    • Modifies the following command line options:
      • Adds --log-config option. Users can now specify XML log4cxx configuration file, which can be used for server/proxy/jubavisor.
      • Removes --loglevel option. Users should specify desired log levels in XML configuration file.
      • Modifies --logdir option. This option is now only used for printing ZooKeeper logs.
      • Deprecates --debug option for interactive commands (jubaconfig and jubactl) --debug option used to have a effect of printing logs to standard output instead of files. As I think those commands should not write logs, these commands now always prints their logs to standard output.
    • I added #include <errno.h> for some files. This is because errno is being used without including errno.h. As glog/logging.h indirectly includes errno.h, the problem did not become apparent until now.
    • Installs example XML configuration files to ${PREFIX}/jubatus/share/example/log/log4cxx.xml.
    opened by kmaehashi 11
  • Return user defined errors in RPC

    Return user defined errors in RPC

    Users only define return values in the current IDL. When I want to return error codes or something like that, they need to make a user-defined message, which includes a success/fail flag and an error code. It's not cool. I want to define possible error types in IDL, and generated servers and clients must transfer these error objects through RPC.

    For examle:

    void update_node(0: string id, 1: map<string, string> property) throws unknown_id, invalid_property
    

    This API may raise unknwon_id exception or invalid_property exception.

    In server side, we need to catch these user defined exceptions. In client side, check error object and deserialize them to user defined exceptions.

    This modification is not difficult but compatibility is destructed.

    jenerator improvement stale 
    opened by unnonouno 11
  • recommender: clear_row don't free acquired resources

    recommender: clear_row don't free acquired resources

    I tried to measure memory size on before and after call clear_row.

    The result is following. The value is RSS size which got by using get_status.

    | method | after start-up | after update_row 100K ID | after clear_row 10K ID | | --- | --- | --- | --- | | lsh | 4312 | 521888 | 522048 | | minhash | 4308 | 216708 | 216840 | | euclid_lsh | 4336 | 236324 | 236672 | | inverted_index | 4292 | 434044 | 434192 |

    We cannot use jubarecommender long time without some solution (e.g. stop service and add memory ...).

    improvement algorithm 
    opened by rimms 11
  • client socket is to stale.

    client socket is to stale.

    Client socket created and no RPC called for some seconds, the jubatus server is to close the socket. Then the client cannot call RPC because the socket was closed, so timeout_error returned. It is not useful for client, closed socket should try to reconnect when the socket has been closed by server.

    client improvement stale 
    opened by kumagi 11
  • A server always returns no response and raises request-time-out

    A server always returns no response and raises request-time-out

    Yesterday, the client library always raises request time out error, and the server returns no response. The situation is below:

    • The server process is alive
    • Even get_stauts method does not respond
    • No other clients use the server
    • CPU usage of the server is very low
    • I forgot put logs to a file ;-(

    I suspect that the server gets write-lock, and accidentally it cannot release the lock.

    If anyone else face the same situation, please teach me.

    bug server 
    opened by unnonouno 10
  • Enabled -std=c++11 option if compiler supported

    Enabled -std=c++11 option if compiler supported

    dependent on:

    • https://github.com/jubatus/jubatus-mpio/pull/21
    • https://github.com/jubatus/jubatus-msgpack-rpc/pull/23
    • https://github.com/jubatus/jubatus_core/pull/254

    related issue: #945

    improvement other 
    opened by kazuki 0
  • [jenerator] Serializable Java client

    [jenerator] Serializable Java client

    In some cases, Jubatus java client is used in serialized such as storm topology. Current jubatus client throws NoSerializableException through this process. This patch enables jenerator to create Serializable clients and user defined data.

    jenerator improvement 
    opened by Lewuathe 2
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

English | 简体中文 Welcome to the PaddlePaddle GitHub. PaddlePaddle, as the only independent R&D deep learning platform in China, has been officially open

null 19.4k Dec 30, 2022
Distributed-systems-algos - Distributed Systems Algorithms For Python

Distributed Systems Algorithms ISIS algorithm In an asynchronous system that kee

Tony Joo 2 Nov 30, 2022
Distributed machine learning platform

Veles Distributed platform for rapid Deep learning application development Consists of: Platform - https://github.com/Samsung/veles Znicz Plugin - Neu

Samsung 897 Dec 5, 2022
Microsoft Distributed Machine Learning Toolkit

DMTK Distributed Machine Learning Toolkit https://www.dmtk.io Please open issues in the project below. For any technical support email to dmtk@microso

Microsoft 2.8k Nov 19, 2022
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make dis

Horovod 12.9k Dec 29, 2022
Ray provides a simple, universal API for building distributed applications.

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

null 23.5k Jan 5, 2023
Distributed Synchronization for Python

Distributed Synchronization for Python Tutti is a nearly drop-in replacement for python's built-in synchronization primitives that lets you fearlessly

Hamilton Kibbe 4 Jul 7, 2022
A lightweight python module for building event driven distributed systems

Eventify A lightweight python module for building event driven distributed systems. Installation pip install eventify Problem Developers need a easy a

Eventify 16 Aug 18, 2022
Run Python in Apache Storm topologies. Pythonic API, CLI tooling, and a topology DSL.

Streamparse lets you run Python code against real-time streams of data via Apache Storm. With streamparse you can create Storm bolts and spouts in Pyt

Parsely, Inc. 1.5k Dec 22, 2022
ZeroNet - Decentralized websites using Bitcoin crypto and BitTorrent network

ZeroNet Decentralized websites using Bitcoin crypto and the BitTorrent network - https://zeronet.io / onion Why? We believe in open, free, and uncenso

ZeroNet 17.8k Jan 3, 2023
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

null 92 Dec 14, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Vowpal Wabbit 8.1k Jan 6, 2023
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 23.2k Dec 30, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 23.3k Dec 31, 2022
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

English | 简体中文 Welcome to the PaddlePaddle GitHub. PaddlePaddle, as the only independent R&D deep learning platform in China, has been officially open

null 19.4k Jan 4, 2023
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

English | 简体中文 Welcome to the PaddlePaddle GitHub. PaddlePaddle, as the only independent R&D deep learning platform in China, has been officially open

null 19.4k Dec 30, 2022
Uber Open Source 1.6k Dec 31, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 8, 2023
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 8, 2023