XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

Overview

Distributed XGBoost on Ray

Build Status docs.ray.io

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray

All releases are tested on large clusters and workloads.

Installation

You can install the latest XGBoost-Ray release from PIP:

pip install xgboost_ray

If you'd like to install the latest master, use this command instead:

pip install git+https://github.com/ray-project/xgboost_ray.git#xgboost_ray

Usage

XGBoost-Ray provides a drop-in replacement for XGBoost's train function. To pass data, instead of using xgb.DMatrix you will have to use xgboost_ray.RayDMatrix.

Distributed training parameters are configured with a xgboost_ray.RayParams object. For instance, you can set the num_actors property to specify how many distributed actors you would like to use.

Here is a simplified example (which requires sklearn):

Training:

from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)

evals_result = {}
bst = train(
    {
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    train_set,
    evals_result=evals_result,
    evals=[(train_set, "train")],
    verbose_eval=False,
    ray_params=RayParams(
        num_actors=2,  # Number of remote actors
        cpus_per_actor=1))

bst.save_model("model.xgb")
print("Final training error: {:.4f}".format(
    evals_result["train"]["error"][-1]))

Prediction:

from xgboost_ray import RayDMatrix, RayParams, predict
from sklearn.datasets import load_breast_cancer
import xgboost as xgb

data, labels = load_breast_cancer(return_X_y=True)

dpred = RayDMatrix(data, labels)

bst = xgb.Booster(model_file="model.xgb")
pred_ray = predict(bst, dpred, ray_params=RayParams(num_actors=2))

print(pred_ray)

scikit-learn API

XGBoost-Ray also features a scikit-learn API fully mirroring pure XGBoost scikit-learn API, providing a completely drop-in replacement. The following estimators are available:

  • RayXGBClassifier
  • RayXGRegressor
  • RayXGBRFClassifier
  • RayXGBRFRegressor
  • RayXGBRanker

Example usage of RayXGBClassifier:

from xgboost_ray import RayXGBClassifier, RayParams
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

seed = 42

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.25, random_state=42
)

clf = RayXGBClassifier(
    n_jobs=4,  # In XGBoost-Ray, n_jobs sets the number of actors
    random_state=seed
)

# scikit-learn API will automatically conver the data
# to RayDMatrix format as needed.
# You can also pass X as a RayDMatrix, in which case
# y will be ignored.

clf.fit(X_train, y_train)

pred_ray = clf.predict(X_test)
print(pred_ray)

pred_proba_ray = clf.predict_proba(X_test)
print(pred_proba_ray)

# It is also possible to pass a RayParams object
# to fit/predict/predict_proba methods - will override
# n_jobs set during initialization

clf.fit(X_train, y_train, ray_params=RayParams(num_actors=2))

pred_ray = clf.predict(X_test, ray_params=RayParams(num_actors=2))
print(pred_ray)

Things to keep in mind:

  • n_jobs parameter controls the number of actors spawned. You can pass a RayParams object to the fit/predict/predict_proba methods as the ray_params argument for greater control over resource allocation. Doing so will override the value of n_jobs with the value of ray_params.num_actors attribute. For more information, refer to the Resources section below.
  • By default n_jobs is set to 1, which means the training will not be distributed. Make sure to either set n_jobs to a higher value or pass a RayParams object as outlined above in order to take advantage of XGBoost-Ray's functionality.
  • After calling fit, additional evaluation results (e.g. training time, number of rows, callback results) will be available under additional_results_ attribute.
  • XGBoost-Ray's scikit-learn API is based on XGBoost 1.4. While we try to support older XGBoost versions, please note that this library is only fully tested and supported for XGBoost >= 1.4.

For more information on the scikit-learn API, refer to the XGBoost documentation.

Data loading

Data is passed to XGBoost-Ray via a RayDMatrix object.

The RayDMatrix lazy loads data and stores it sharded in the Ray object store. The Ray XGBoost actors then access these shards to run their training on.

A RayDMatrix support various data and file types, like Pandas DataFrames, Numpy Arrays, CSV files and Parquet files.

Example loading multiple parquet files:

import glob    
from xgboost_ray import RayDMatrix, RayFileType

# We can also pass a list of files
path = list(sorted(glob.glob("/data/nyc-taxi/*/*/*.parquet")))

# This argument will be passed to `pd.read_parquet()`
columns = [
    "passenger_count",
    "trip_distance", "pickup_longitude", "pickup_latitude",
    "dropoff_longitude", "dropoff_latitude",
    "fare_amount", "extra", "mta_tax", "tip_amount",
    "tolls_amount", "total_amount"
]

dtrain = RayDMatrix(
    path, 
    label="passenger_count",  # Will select this column as the label
    columns=columns, 
    filetype=RayFileType.PARQUET)

Hyperparameter Tuning

XGBoost-Ray integrates with Ray Tune to provide distributed hyperparameter tuning for your distributed XGBoost models. You can run multiple XGBoost-Ray training runs in parallel, each with a different hyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training code to a function, and pass the function to tune.run. Internally, train will detect if tune is being used and will automatically report results to tune.

Example using XGBoost-Ray with Ray Tune:

from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

num_actors = 4
num_cpus_per_actor = 1

ray_params = RayParams(
    num_actors=num_actors,
    cpus_per_actor=num_cpus_per_actor)

def train_model(config):
    train_x, train_y = load_breast_cancer(return_X_y=True)
    train_set = RayDMatrix(train_x, train_y)

    evals_result = {}
    bst = train(
        params=config,
        dtrain=train_set,
        evals_result=evals_result,
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=ray_params)
    bst.save_model("model.xgb")

from ray import tune

# Specify the hyperparameter search space.
config = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
    "eta": tune.loguniform(1e-4, 1e-1),
    "subsample": tune.uniform(0.5, 1.0),
    "max_depth": tune.randint(1, 9)
}

# Make sure to use the `get_tune_resources` method to set the `resources_per_trial`
analysis = tune.run(
    train_model,
    config=config,
    metric="train-error",
    mode="min",
    num_samples=4,
    resources_per_trial=ray_params.get_tune_resources())
print("Best hyperparameters", analysis.best_config)

Also see examples/simple_tune.py for another example.

Fault tolerance

XGBoost-Ray leverages the stateful Ray actor model to enable fault tolerant training. There are currently two modes implemented.

Non-elastic training (warm restart)

When an actor or node dies, XGBoost-Ray will retain the state of the remaining actors. In non-elastic training, the failed actors will be replaced as soon as resources are available again. Only these actors will reload their parts of the data. Training will resume once all actors are ready for training again.

You can set this mode in the RayParams:

from xgboost_ray import RayParams

ray_params = RayParams(
    elastic_training=False,  # Use non-elastic training
    max_actor_restarts=2,    # How often are actors allowed to fail
)

Elastic training

In elastic training, XGBoost-Ray will continue training with fewer actors (and on fewer data) when a node or actor dies. The missing actors are staged in the background, and are reintegrated into training once they are back and loaded their data.

This mode will train on fewer data for a period of time, which can impact accuracy. In practice, we found these effects to be minor, especially for large shuffled datasets. The immediate benefit is that training time is reduced significantly to almost the same level as if no actors died. Thus, especially when data loading takes a large part of the total training time, this setting can dramatically speed up training times for large distributed jobs.

You can configure this mode in the RayParams:

from xgboost_ray import RayParams

ray_params = RayParams(
    elastic_training=True,  # Use elastic training
    max_failed_actors=3,    # Only allow at most 3 actors to die at the same time
    max_actor_restarts=2,   # How often are actors allowed to fail
)

Resources

By default, XGBoost-Ray tries to determine the number of CPUs available and distributes them evenly across actors.

In the case of very large clusters or clusters with many different machine sizes, it makes sense to limit the number of CPUs per actor by setting the cpus_per_actor argument. Consider always setting this explicitly.

The number of XGBoost actors always has to be set manually with the num_actors argument.

Multi GPU training

XGBoost-Ray enables multi GPU training. The XGBoost core backend will automatically leverage NCCL2 for cross-device communication. All you have to do is to start one actor per GPU.

For instance, if you have 2 machines with 4 GPUs each, you will want to start 8 remote actors, and set gpus_per_actor=1. There is usually no benefit in allocating less (e.g. 0.5) or more than one GPU per actor.

You should divide the CPUs evenly across actors per machine, so if your machines have 16 CPUs in addition to the 4 GPUs, each actor should have 4 CPUs to use.

from xgboost_ray import RayParams

ray_params = RayParams(
    num_actors=8,
    gpus_per_actor=1,
    cpus_per_actor=4,   # Divide evenly across actors per machine
)

How many remote actors should I use?

This depends on your workload and your cluster setup. Generally there is no inherent benefit of running more than one remote actor per node for CPU-only training. This is because XGBoost core can already leverage multiple CPUs via threading.

However, there are some cases when you should consider starting more than one actor per node:

  • For multi GPU training, each GPU should have a separate remote actor. Thus, if your machine has 24 CPUs and 4 GPUs, you will want to start 4 remote actors with 6 CPUs and 1 GPU each
  • In a heterogeneous cluster, you might want to find the greatest common divisor for the number of CPUs. E.g. for a cluster with three nodes of 4, 8, and 12 CPUs, respectively, you should set the number of actors to 6 and the CPUs per actor to 4.

Distributed data loading

XGBoost-Ray can leverage both centralized and distributed data loading.

In centralized data loading, the data is partitioned by the head node and stored in the object store. Each remote actor then retrieves their partitions by querying the Ray object store. Centralized loading is used when you pass centralized in-memory dataframes, such as Pandas dataframes or Numpy arrays, or when you pass a single source file, such as a single CSV or Parquet file.

from xgboost_ray import RayDMatrix

# This will use centralized data loading, as only one source file is specified
# `label_col` is a column in the CSV, used as the target label
ray_params = RayDMatrix("./source_file.csv", label="label_col")

In distributed data loading, each remote actor loads their data directly from the source (e.g. local hard disk, NFS, HDFS, S3), without a central bottleneck. The data is still stored in the object store, but locally to each actor. This mode is used automatically when loading data from multiple CSV or Parquet files. Please note that we do not check or enforce partition sizes in this case - it is your job to make sure the data is evenly distributed across the source files.

from xgboost_ray import RayDMatrix

# This will use distributed data loading, as four source files are specified
# Please note that you cannot schedule more than four actors in this case.
# `label_col` is a column in the Parquet files, used as the target label
ray_params = RayDMatrix([
    "hdfs:///tmp/part1.parquet",
    "hdfs:///tmp/part2.parquet",
    "hdfs:///tmp/part3.parquet",
    "hdfs:///tmp/part4.parquet",
], label="label_col")

Lastly, XGBoost-Ray supports distributed dataframe representations, such as Modin and Dask dataframes (used with Dask on Ray). Here, XGBoost-Ray will check on which nodes the distributed partitions are currently located, and will assign partitions to actors in order to minimize cross-node data transfer. Please note that we also assume here that partition sizes are uniform.

from xgboost_ray import RayDMatrix

# This will try to allocate the existing Modin partitions
# to co-located Ray actors. If this is not possible, data will
# be transferred across nodes
ray_params = RayDMatrix(existing_modin_df)

Data sources

Type Centralized loading Distributed loading
Numpy array Yes No
Pandas dataframe Yes No
Single CSV Yes No
Multi CSV Yes Yes
Single Parquet Yes No
Multi Parquet Yes Yes
Petastorm Yes Yes
Ray MLDataset Yes Yes
Dask dataframe Yes Yes
Modin dataframe Yes Yes

Memory usage

XGBoost uses a compute-optimized datastructure, the DMatrix, to hold training data. When converting a dataset to a DMatrix, XGBoost creates intermediate copies and ends up holding a complete copy of the full data. The data will be converted into the local dataformat (on a 64 bit system these are 64 bit floats.) Depending on the system and original dataset dtype, this matrix can thus occupy more memory than the original dataset.

The peak memory usage for CPU-based training is at least 3x the dataset size (assuming dtype float32 on a 64bit system) plus about 400,000 KiB for other resources, like operating system requirements and storing of intermediate results.

Example

  • Machine type: AWS m5.xlarge (4 vCPUs, 16 GiB RAM)
  • Usable RAM: ~15,350,000 KiB
  • Dataset: 1,250,000 rows with 1024 features, dtype float32. Total size: 5,000,000 KiB
  • XGBoost DMatrix size: ~10,000,000 KiB

This dataset will fit exactly on this node for training.

Note that the DMatrix size might be lower on a 32 bit system.

GPUs

Generally, the same memory requirements exist for GPU-based training. Additionally, the GPU must have enough memory to hold the dataset.

In the example above, the GPU must have at least 10,000,000 KiB (about 9.6 GiB) memory. However, empirically we found that using a DeviceQuantileDMatrix seems to show more peak GPU memory usage, possibly for intermediate storage when loading data (about 10%).

Best practices

In order to reduce peak memory usage, consider the following suggestions:

  • Store data as float32 or less. More precision is often not needed, and keeping data in a smaller format will help reduce peak memory usage for initial data loading.
  • Pass the dtype when loading data from CSV. Otherwise, floating point values will be loaded as np.float64 per default, increasing peak memory usage by 33%.

Placement Strategies

XGBoost-Ray leverages Ray's Placement Group API (https://docs.ray.io/en/master/placement-group.html) to implement placement strategies for better fault tolerance.

By default, a SPREAD strategy is used for training, which attempts to spread all of the training workers across the nodes in a cluster on a best-effort basis. This improves fault tolerance since it minimizes the number of worker failures when a node goes down, but comes at a cost of increased inter-node communication To disable this strategy, set the USE_SPREAD_STRATEGY environment variable to 0. If disabled, no particular placement strategy will be used.

Note that this strategy is used only when elastic_training is not used. If elastic_training is set to True, no placement strategy is used.

When XGBoost-Ray is used with Ray Tune for hyperparameter tuning, a PACK strategy is used. This strategy attempts to place all workers for each trial on the same node on a best-effort basis. This means that if a node goes down, it will be less likely to impact multiple trials.

When placement strategies are used, XGBoost-Ray will wait for 100 seconds for the required resources to become available, and will fail if the required resources cannot be reserved and the cluster cannot autoscale to increase the number of resources. You can change the PLACEMENT_GROUP_TIMEOUT_S environment variable to modify how long this timeout should be.

More examples

Fore complete end to end examples, please have a look at the examples folder:

Resources

Comments
  • Elastic failure handling

    Elastic failure handling

    Introduces elastic failure handling. Alive actors are not recreated on error, but stick around and don't have to load data again. We can also choose to not restart failed actors, continuing training with fewer actors.

    Depends on #21.

    opened by krfricke 14
  • Placement Group Support

    Placement Group Support

    Depends on https://github.com/ray-project/xgboost_ray/pull/29.

    Adds placement group for xgboost training workers.

    Logic is as follows:

    • If not using Tune:
      • If elastic_training == True
        • Don't use placement groups (Requires placement group resizing https://github.com/ray-project/ray/issues/12562)
      • If elastic_training == False
        • Use placement group with SPREAD strategy
    • If using Tune:
      • If elastic_training == True
        • Raise Error. Tune does not support elastic training.
      • If elastic_training == False
        • Use placement group with PACK strategy
    opened by amogkam 12
  • HIGGS example not working

    HIGGS example not working

    Hi,

    I'm trying to reproduce the example about the HIGGS dataset but when I start the training I get the following messages and the actors don't get assigned any data or training jobs but 1 out of 4. the only actor that gets to do some work eventually dies as well. I'm using Python 3.7.10 on Mac OS, RAY 1.2.0 and xgboost_ray 0.0.4.

    Your help would be appreciated. Thanks!

    2021-04-13 18:01:45,108	INFO main.py:791 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
    2021-04-13 18:01:51,423	ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71384, ip=192.168.0.164)
      File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
        self._local_n = len(param["data"])
    TypeError: object of type 'NoneType' has no len()
    2021-04-13 18:01:51,424	ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71377, ip=192.168.0.164)
      File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
        self._local_n = len(param["data"])
    TypeError: object of type 'NoneType' has no len()
    2021-04-13 18:01:51,424	ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71379, ip=192.168.0.164)
      File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
        self._local_n = len(param["data"])
    TypeError: object of type 'NoneType' has no len()
    2021-04-13 18:01:51,425	ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71384, ip=192.168.0.164)
      File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
        self._local_n = len(param["data"])
    TypeError: object of type 'NoneType' has no len()
    2021-04-13 18:01:51,425	ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71377, ip=192.168.0.164)
      File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
        self._local_n = len(param["data"])
    TypeError: object of type 'NoneType' has no len()
    2021-04-13 18:01:51,426	ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71379, ip=192.168.0.164)
      File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
        self._local_n = len(param["data"])
    TypeError: object of type 'NoneType' has no len()
    2021-04-13 18:02:15,172	INFO main.py:823 -- Waiting until actors are ready (30 seconds passed).
    2021-04-13 18:02:45,187	INFO main.py:823 -- Waiting until actors are ready (60 seconds passed).
    2021-04-13 18:03:10,454	ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71383, ip=192.168.0.164)
      File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: 'label'
    
    The above exception was the direct cause of the following exception:
    
    ray::RayXGBoostActor.load_data() (pid=71383, ip=192.168.0.164)
      File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 423, in load_data
        param = data.get_data(self.rank, self.num_actors)
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/matrix.py", line 707, in get_data
        self.load_data(num_actors=num_actors, rank=rank)
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/matrix.py", line 694, in load_data
        self.num_actors, self.sharding, rank=rank)
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/matrix.py", line 467, in load_data
        local_df, data_source=data_source)
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/matrix.py", line 199, in _split_dataframe
        label, exclude = data_source.get_column(local_data, self.label)
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/data_sources/data_source.py", line 106, in get_column
        return data[column], column
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in __getitem__
        indexer = self.columns.get_loc(key)
      File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
        raise KeyError(key) from err
    KeyError: 'label'
    2021-04-13 18:03:15,224	INFO main.py:823 -- Waiting until actors are ready (90 seconds passed).
    2021-04-13 18:03:45,276	INFO main.py:823 -- Waiting until actors are ready (120 seconds passed).
    2021-04-13 18:04:15,352	INFO main.py:823 -- Waiting until actors are ready (150 seconds passed).
    2021-04-13 18:04:22,525	INFO main.py:834 -- [RayXGBoost] Starting XGBoost training.
    2021-04-13 18:05:35,648	INFO elastic.py:155 -- Actor status: 4 alive, 0 dead (4 total)
    2021-04-13 18:05:40,344	ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.train() (pid=71383, ip=192.168.0.164)
      File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: 'label'
    
    opened by vecorro 8
  • Limit number of parallel GHA runs, print env info in tests

    Limit number of parallel GHA runs, print env info in tests

    Unfortunately GHA runs are sometimes aborted due to memory limits. Reducing the number of concurrent trials increases our test time to 2x, but at least these errors stop coming up.

    opened by krfricke 8
  • single-node xgboost-ray is  3.5x slower than single-node vanilla xgboost

    single-node xgboost-ray is 3.5x slower than single-node vanilla xgboost

    Hi guys, I am trying to improve my xgboost training with xgboost-ray due to its amazing features, like multi-node training, distributed data loading etc. However, my experiment shows that training with xgboost is 3.5 faster than xgboost-ray on single node.

    My configuration is as follows:

    XGBoost-Ray Version: 0.1.10 XGBoost Version: 1.6.1 ray: 1.13.0

    Training Data size: 18GB, with 51,995,904 rows and 174 columns. Cluster Machine: 160 Cores and 500 RAM

    Ray Config: 1 actor and 154 cpus_per_actor Ray is initialized with ray.init() and the ray cluster is set up manually and the nodes are connected via Docker swarm (which might not be relevant for this issue)

    Please let me know if there is more info needed from your side. Thanks a lot!

    opened by Faaany 7
  • xgboost_ray on K8 breaks when running the HIGGS dataset example

    xgboost_ray on K8 breaks when running the HIGGS dataset example

    Hi,

    My code runs properly when Ray runs on a single machine but when I try to run it on Ray 2.0 deployed on K8, the execution breaks at the point indicated by the message error. The Ray cluster seems to be working well as other example scrips run properly on it. From the dashboard, it looks like the head never tries to load the dataset. I'd appreciate your guidance to find and fix the problem.

    import ray
    import ray.util
    import os
    import time
    from xgboost_ray import train, RayDMatrix, RayParams
    import pandas as pd
    
    
    ray.util.connect("127.0.0.1:10001") # replace with the appropriate IP and port numbers
    
    {'num_clients': 1,
     'python_version': '3.7.7',
     'ray_version': '2.0.0.dev0',
     'ray_commit': 'b0a813baad0a4a187f0cc25e11498843aff899c6',
     'protocol_version': '2021-04-09'}
    
    colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]
    
    ​
    
    df = pd.read_csv("../academy/academy-main/HIGGS.csv", names=colnames)
    
    ​
    
    dtrain = RayDMatrix(df, label="label", distributed=False)
    
    ​
    
    config = {
    
        "tree_method": "hist",
    
        "eval_metric": ["logloss", "error"],
    
    }
    
    ​
    
    evals_result = {}
    
    #ray.init(object_store_memory=34359738368)
    
    ​
    
    start = time.time()
    
    bst = train(
    
        config,
    
        dtrain,
    
        evals_result=evals_result,
    
        ray_params=RayParams(max_actor_restarts=1),
    
        num_boost_round=100,
    
        evals=[(dtrain, "train")])
    
    taken = time.time() - start
    
    print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")
    
    ​
    
    bst.save_model("higgs.xgb")
    
    print("Final training error: {:.4f}".format(
    
        evals_result["train"]["error"][-1]))
    

    The node with node id: 3d09a2cf73e5497816818576f38466d9989b6d2f166896cc9ceccc3f and ip: 10.1.0.71 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.


    _InactiveRpcError Traceback (most recent call last) ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/worker.py in _call_schedule_for_task(self, task) 307 try: --> 308 ticket = self.server.Schedule(task, metadata=self.metadata) 309 except grpc.RpcError as e:

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/grpc/_channel.py in call(self, request, timeout, metadata, credentials, wait_for_ready, compression) 824 state, call, = self._blocking(request, timeout, metadata, credentials, --> 825 wait_for_ready, compression) 826 return _end_unary_response_blocking(state, call, False, None)

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/grpc/_channel.py in _blocking(self, request, timeout, metadata, credentials, wait_for_ready, compression) 803 if state is None: --> 804 raise rendezvous # pylint: disable-msg=raising-bad-type 805 else:

    _InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INTERNAL details = "Exception serializing request!" debug_error_string = "None"

    During handling of the above exception, another exception occurred:

    Error Traceback (most recent call last) in 8 ray_params=RayParams(max_actor_restarts=1), 9 num_boost_round=100, ---> 10 evals=[(dtrain, "train")]) 11 taken = time.time() - start 12 print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py in train(params, dtrain, num_boost_round, evals, evals_result, additional_results, ray_params, _remote, *args, **kwargs) 1045 ray_params=ray_params, 1046 _remote=False, -> 1047 **kwargs, 1048 )) 1049 if isinstance(evals_result, dict):

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/remote_function.py in _remote_proxy(*args, **kwargs) 102 @wraps(function) 103 def _remote_proxy(*args, **kwargs): --> 104 return self._remote(args=args, kwargs=kwargs) 105 106 self.remote = _remote_proxy

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/remote_function.py in _remote(self, args, kwargs, num_returns, num_cpus, num_gpus, memory, object_store_memory, accelerator_type, resources, max_retries, placement_group, placement_group_bundle_index, placement_group_capture_child_tasks, runtime_env, override_environment_variables, name) 207 runtime_env=runtime_env, 208 override_environment_variables=override_environment_variables, --> 209 name=name) 210 211 worker = ray.worker.global_worker

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/_private/client_mode_hook.py in client_mode_convert_function(func_cls, in_args, in_kwargs, **kwargs) 85 setattr(func_cls, RAY_CLIENT_MODE_ATTR, key) 86 client_func = ray._get_converted(key) ---> 87 return client_func._remote(in_args, in_kwargs, **kwargs) 88 89

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/common.py in _remote(self, args, kwargs, **option_args) 107 if kwargs is None: 108 kwargs = {} --> 109 return self.options(**option_args).remote(*args, **kwargs) 110 111 def repr(self):

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/common.py in remote(self, *args, **kwargs) 284 285 def remote(self, *args, **kwargs): --> 286 return return_refs(ray.call_remote(self, *args, **kwargs)) 287 288 def getattr(self, key):

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/api.py in call_remote(self, instance, *args, **kwargs) 94 kwargs: opaque keyword arguments 95 """ ---> 96 return self.worker.call_remote(instance, *args, **kwargs) 97 98 def call_release(self, id: bytes) -> None:

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/worker.py in call_remote(self, instance, *args, **kwargs) 299 for k, v in kwargs.items(): 300 task.kwargs[k].CopyFrom(convert_to_arg(v, self._client_id)) --> 301 return self._call_schedule_for_task(task) 302 303 def _call_schedule_for_task(

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/worker.py in _call_schedule_for_task(self, task) 308 ticket = self.server.Schedule(task, metadata=self.metadata) 309 except grpc.RpcError as e: --> 310 raise decode_exception(e.details()) 311 312 if not ticket.valid:

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/worker.py in decode_exception(data) 520 521 def decode_exception(data) -> Exception: --> 522 data = base64.standard_b64decode(data) 523 return loads_from_server(data)

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/base64.py in standard_b64decode(s) 103 are discarded prior to the padding check. 104 """ --> 105 return b64decode(s) 106 107

    ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/base64.py in b64decode(s, altchars, validate) 85 if validate and not re.fullmatch(b'[A-Za-z0-9+/]*={0,2}', s): 86 raise binascii.Error('Non-base64 digit found') ---> 87 return binascii.a2b_base64(s) 88 89

    Error: Incorrect padding

    The node with node id: e52c10ddca43f8a398d222e6dec353457bb48faaaf3ff7ae72303e3e and ip: 10.1.0.69 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

    ​

    opened by vecorro 7
  • Add sort dataframe logic on qid

    Add sort dataframe logic on qid

    If qid is provided, xgb.DMatrix requires data to be sorted by qid - see data.cc: https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/matrix.py#L351

    draft to add auto sorting of dataframe if qid is given and dataframe is not already sorted by qid

    opened by atomic 6
  • group parameter not being used in RayXGBRanker

    group parameter not being used in RayXGBRanker

    Hi!

    I have been trying to use RayXGBRanker, but it seems like the group parameter is not being considered for building the model.

    Attaching the following code to reproduce the same:

    Data preparation

    from xgboost_ray import RayXGBRanker, RayParams
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
    import numpy as np
    
    seed = 42
    
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
       X, y, train_size=0.1, random_state=42
    )
    # This is done just to get random relevant values for ranker
    y_train = np.random.rand(len(y_train))
    
    

    1st inference

    group=np.array([[56]])
    clf = RayXGBRanker(
       n_jobs=2,
       random_state=seed
    )
    clf.fit(X_train, y_train, group=group, ray_dmatrix_params={})
    pred_ray = clf.predict(X_test[:10])
    print(pred_ray)
    

    1st inference result: [-1.4302582 0.80665004 -1.656052 2.8729537 0.23208997 2.1761725 -1.1084297 -2.1263309 5.631699 -4.417554 ]

    2nd inference

    group=np.array([[20, 10, 26]])
    clf = RayXGBRanker(
       n_jobs=2,
       random_state=seed
    )
    clf.fit(X_train, y_train, group=group, ray_dmatrix_params={})
    pred_ray = clf.predict(X_test[:10])
    print(pred_ray)
    

    2nd inference result: [-1.4302582 0.80665004 -1.656052 2.8729537 0.23208997 2.1761725 -1.1084297 -2.1263309 5.631699 -4.417554 ]

    As we can see, both 1st and 2nd inference pieces are giving the same predictions even though we have changed the group parameter. Could you guys please let me know any solution to this problem?

    Versions
    xgboost==1.5.2
    xgboost-ray==0.1.6
    ray==1.9.2
    

    Thanks! Rama

    bug 
    opened by ramab1988 6
  • Is there's a way to set remote ray cluster address?

    Is there's a way to set remote ray cluster address?

    I am new to ray community and I plan to run a XGBoost example to a remote cluster but can not find documentation.

    Seems following methods use ray.init() directly without remote options.

    https://github.com/ray-project/xgboost_ray/blob/256fb55c2956192a303e9f8dc6a430bedee0ba4a/xgboost_ray/main.py#L1048-L1049

    Should I manually do ray.init(addresses={remove_addr}) or ray.util.connect({remote}) in the code snippet? In this way, xgboost_ray package will think ray.is_initialized() is true and skip initialization?

    opened by Jeffwan 6
  • "grpc_message":"Received message larger than max

    _InactiveRpcError Traceback (most recent call last) in 47 num_samples=1, 48 scheduler=ASHAScheduler(max_t=200), ---> 49 resources_per_trial=ray_params.get_tune_resources()) 50 print("Best hyperparameters", analysis.best_config) ..... /usr/local/lib/python3.6/dist-packages/grpc/_channel.py in _end_unary_response_blocking(state, call, with_call, deadline) 847 return state.response 848 else: --> 849 raise _InactiveRpcError(state) 850 851

    _InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.RESOURCE_EXHAUSTED details = "Received message larger than max (210771146 vs. 104857600)" debug_error_string = "{"created":"@1646188124.309444695","description":"Error received from peer ipv4:10.207.183.32:40455","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Received message larger than max (210771146 vs. 104857600)","grpc_status":8}"

    opened by cgy-dayup 5
  • Fix cluster resource detection in multi-node clusters for number_actors < number_nodes

    Fix cluster resource detection in multi-node clusters for number_actors < number_nodes

    When starting 4 distributed actors on a 8 node cluster with 16 CPUs each (144 CPUs), we autodetect 36 CPUs and schedule a placement group with 4 bundles a 36 CPUs.

    Instead we should be smart about this and only assign 16 CPUs per worker (and maybe throw a warning that the cluster will be underutilized).

    cc @mmui would you be interested in taking this?

    opened by krfricke 5
  • rabit timeout

    rabit timeout

    xgboost.core.XGBoostError: [23:55:55] /home/conda/feedstock_root/build_artifacts/xgboost-split_1667849645640/work/rabit/include/rabit/internal/socket.h:170: Poll timeout

    what params should i pass to avoid this timeout? im training a huge dataset

    opened by Biu-G 1
  • How to use bst.eval_set() and bst.update() with xgboost_ray

    How to use bst.eval_set() and bst.update() with xgboost_ray

    I am trying to adopt xgboost_ray for a xgboost project. Currently I meet a problem. The original code is doing some fine grain control on the training process. for every iteration

           eval_results = self.bst.eval_set(
                evals=[(self.dmat_train, "train"), (self.dmat_valid, "valid")], iteration=self.bst.num_boosted_rounds() - 1
            )
            self.log_info(fl_ctx, eval_results)
            auc = float(eval_results.split("\t")[2].split(":")[1])
            for i in range(self.trees_per_round):
                self.bst.update(self.dmat_train, self.bst.num_boosted_rounds())
    
            # extract newly added self.trees_per_round using xgboost slicing api
            bst = self.bst[self.bst.num_boosted_rounds() - self.trees_per_round : self.bst.num_boosted_rounds()]
    

    code source: https://github.com/NVIDIA/NVFlare/blob/dev/nvflare/app_opt/xgboost/tree_based/executor.py#L153-L174

    Note: I already get bst object from xgboost_ray.train()

    There're two blockers, they are bst.eval_set() and bst.update() since bst is from xgboost library, it won't accept RDMatrix which throws an error here.

      File "/usr/local/lib/python3.8/site-packages/xgboost/core.py", line 1980, in eval_set
        raise TypeError(f"expected DMatrix, got {type(d[0]).__name__}")
    TypeError: expected DMatrix, got RayDMatrix
    

    I look at the documentation and can not find the replacement like predict. How can I make it?

    /cc @Yard1

    opened by Jeffwan 3
  • Compatibility with Ray and XGBoost

    Compatibility with Ray and XGBoost

    I am planning to use xgboost_ray for a specific XGBoost version (1.7.1) How can I know whether xgboost_ray is compatible or not. Similar like Ray itself. Could maintainer provide a compatibility matrix?

    opened by Jeffwan 1
  • Planned support of multilabel?

    Planned support of multilabel?

    Following this quick tutorial, I was hoping to use XGBoost for multilabel classification by passing label_column as a list within XGBoostTrainer. Is there any plan to support this functionality? https://xgboost.readthedocs.io/en/stable/tutorials/multioutput.html

    import ray
    import pandas as pd
    import xgboost as xgb
    
    from ray.train.xgboost import XGBoostTrainer, XGBoostPredictor
    from sklearn.datasets import make_multilabel_classification
    from sklearn.model_selection import train_test_split
    
    
    num_classes = 30
    X, y = make_multilabel_classification(
        n_classes=num_classes, random_state=0, n_samples=1000
    )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    
    X_train = pd.DataFrame(X_train, columns=[f"x{i}" for i in range(X_train.shape[1])])
    y_train = pd.DataFrame(y_train, columns=[f"y{i}" for i in range(y_train.shape[1])])
    train_ds = ray.data.from_pandas(pd.concat([X_train, y_train], axis=1))
    
    trainer = XGBoostTrainer(
        # label_column="y1",  # works
        label_column=["y1", "y2"],  # not supported
        params={
            "tree_method": "hist",
            "max_depth": 15,
            "n_estimators": 50,
        },
        num_boost_round=10,
        datasets={"train": train_ds},
    )
    result = trainer.fit()
    

    The trace is:

    Current time: 2022-11-08 20:11:24 (running for 00:00:02.47)
    Memory usage on this node: 21.7/62.8 GiB
    Using FIFO scheduling algorithm.
    Resources requested: 2.0/16 CPUs, 0/0 GPUs, 0.0/25.74 GiB heap, 0.0/12.87 GiB objects
    Result logdir: /home/gcounihan/ray_results/XGBoostTrainer_2022-11-08_20-11-21
    Number of trials: 1/1 (1 RUNNING)
    +----------------------------+----------+---------------------+
    | Trial name                 | status   | loc                 |
    |----------------------------+----------+---------------------|
    | XGBoostTrainer_8362b_00000 | RUNNING  | 10.50.101.142:64403 |
    +----------------------------+----------+---------------------+
    
    
    (XGBoostTrainer pid=64403) /home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/main.py:464: UserWarning: `num_actors` in `ray_params` is smaller than 2 (1). XGBoost will NOT be distributed!
    (XGBoostTrainer pid=64403)   warnings.warn(
    (XGBoostTrainer pid=64403) 2022-11-08 20:11:24,790      ERROR function_trainable.py:298 -- Runner Thread raised error.
    (XGBoostTrainer pid=64403) Traceback (most recent call last):
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 289, in run
    (XGBoostTrainer pid=64403)     self._entrypoint()
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 362, in entrypoint
    (XGBoostTrainer pid=64403)     return self._trainable_func(
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
    (XGBoostTrainer pid=64403)     return method(self, *_args, **_kwargs)
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/train/base_trainer.py", line 460, in _trainable_func
    (XGBoostTrainer pid=64403)     super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 684, in _trainable_func
    (XGBoostTrainer pid=64403)     output = fn()
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/train/base_trainer.py", line 375, in train_func
    (XGBoostTrainer pid=64403)     trainer.training_loop()
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/train/gbdt_trainer.py", line 246, in training_loop
    (XGBoostTrainer pid=64403)     model = self._train(
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/train/xgboost/xgboost_trainer.py", line 77, in _train
    (XGBoostTrainer pid=64403)     return xgboost_ray.train(**kwargs)
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/main.py", line 1482, in train
    (XGBoostTrainer pid=64403)     bst, train_evals_result, train_additional_results = _train(
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/main.py", line 1041, in _train
    (XGBoostTrainer pid=64403)     dtrain.assert_enough_shards_for_actors(num_actors=ray_params.num_actors)
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 788, in assert_enough_shards_for_actors
    (XGBoostTrainer pid=64403)     self.loader.assert_enough_shards_for_actors(num_actors=num_actors)
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 486, in assert_enough_shards_for_actors
    (XGBoostTrainer pid=64403)     data_source = self.get_data_source()
    (XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 448, in get_data_source
    (XGBoostTrainer pid=64403)     raise ValueError(
    (XGBoostTrainer pid=64403) ValueError: Invalid `label` value for distributed datasets: ['y1', 'y2']. Only strings are supported. 
    (XGBoostTrainer pid=64403) FIX THIS by passing a string indicating the label column of the dataset as the `label` argument.
    
    enhancement 
    opened by xbno 1
  • [RFC] Fallback for communication actor scheduling

    [RFC] Fallback for communication actor scheduling

    Signed-off-by: Antoni Baum [email protected]

    If we cannot schedule communication actors on the driver node, relax the requirement to include any node.

    opened by Yard1 0
  • How can I use xgboost_ray to train a model and use origin xgboost for predict?

    How can I use xgboost_ray to train a model and use origin xgboost for predict?

    Because when I train a model the dataset is very large. But when predict there are small dataset and predict more than 1000 times,But xgboost_ray is slower than origin xgboost in predict,How can I solve it

    opened by WZFish 3
Releases(v0.1.13)
  • v0.1.13(Dec 15, 2022)

    What's Changed

    • Bump version to 0.1.13 by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/243
    • Add placement_options to RayParams by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/245
    • Switch from 3.6 to 3.9 in tests by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/246
    • Replace boston dataset with california housing by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/251
    • Set Tune trainable resources to 0 by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/252
    • Add special case in _get_tune_resources by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/250
    • Always detect Ray Dataset as distributed by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/253

    Full Changelog: https://github.com/ray-project/xgboost_ray/compare/v0.1.12...v0.1.13

    Source code(tar.gz)
    Source code(zip)
    xgboost_ray-0.1.13-py3-none-any.whl(134.50 KB)
  • v0.1.12(Nov 1, 2022)

  • v0.1.11(Oct 6, 2022)

    What's Changed

    • Switch to use packaging.Version from distutils LooseVersion by @peytondmurray in https://github.com/ray-project/xgboost_ray/pull/232
    • add license metadata tag to package setup by @jimthompson5802 in https://github.com/ray-project/xgboost_ray/pull/234
    • Add enable_categorical, better detection for being in a PG by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/235
    • Fix ranking failing if the qid column is unsorted by @atomic in https://github.com/ray-project/xgboost_ray/pull/239

    New Contributors

    • @peytondmurray made their first contribution in https://github.com/ray-project/xgboost_ray/pull/232
    • @jimthompson5802 made their first contribution in https://github.com/ray-project/xgboost_ray/pull/234
    • @atomic made their first contribution in https://github.com/ray-project/xgboost_ray/pull/239

    Full Changelog: https://github.com/ray-project/xgboost_ray/compare/v0.1.10...v0.1.11

    Source code(tar.gz)
    Source code(zip)
  • v0.1.10(Aug 15, 2022)

  • v0.1.9(May 23, 2022)

    What's Changed

    • [release] Bump to 0.1.9 for future development by @matthewdeng in https://github.com/ray-project/xgboost_ray/pull/199
    • Update README.md by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/200
    • Mitigate finishing too quickly by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/203
    • Adjust readme for Ray Docs by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/204
    • Fail quickly on Windows by @amogkam in https://github.com/ray-project/xgboost_ray/pull/209
    • Fix test_num_parallel_tree for cutting edge CI by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/211
    • Fix CI failing with xgboost==1.6.0 by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/212
    • Adjust CI timeouts by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/213
    • Fix sklearn docs throwing an error during build by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/214
    • Fix wrap_evaluation_matrices with xgb master by @Yard1 in https://github.com/ray-project/xgboost_ray/pull/215
    • Remove code for legacy Ray versions by @krfricke in https://github.com/ray-project/xgboost_ray/pull/217
    • Remove support for deprecated ray.util.data.MLDataset by @amogkam in https://github.com/ray-project/xgboost_ray/pull/218

    Full Changelog: https://github.com/ray-project/xgboost_ray/compare/v0.1.8...v0.1.9

    Source code(tar.gz)
    Source code(zip)
    xgboost_ray-0.1.10-py3-none-any.whl(225.23 KB)
  • v0.1.8(Feb 25, 2022)

  • v0.1.7(Feb 8, 2022)

  • v0.1.6(Jan 3, 2022)

    • Preserve column ordering in DMatrices (#170)
    • Dependency management (#171, #173, #174, #175)
    • CI improvements (#169)
    • Documentation/example fixes (#178, #180)
    • Validate kwargs correctly (#172)
    • Main: Fix compatibility with XGBoost >= 1.5.0 (#179)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(Oct 29, 2021)

  • v0.1.4(Oct 13, 2021)

  • v0.1.3(Sep 15, 2021)

  • v0.1.2(Jul 12, 2021)

    • Improvements in the SKlearn API (#125, #126)
    • User user warnings instead of logger warnings (#128)
    • Testing fixes (#129)
    • Fix backwards compatibility with placement groups (#130)
    • Improve batching algorithm (#132)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Jun 19, 2021)

    • Fix issues with multi class prediction (#110, #123)
    • Fix bugs and CI (#113, #114, #115, #118, #120)
    • Improve compatibility with external libraries (#108, #120, #121, #122)
    • Add Scitkit-learn interface (#111)

    Special shout out to @Yard1 for his contributions! Thanks!

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(May 20, 2021)

    • Improve fault tolerance (#94)
    • Reduce cross-node data transfer (#100)
    • Improve Ray Tune integration (#102, #103)
    • Improve error reporting for failed actors (#104)
    • Add support for Dask dataframes (Dask on Ray) (#99)
    • Improve documentation (#105)
    Source code(tar.gz)
    Source code(zip)
  • v0.0.5(Apr 26, 2021)

    • Added distributed callbacks called before/after train/data loading (#71)
    • Improved fault tolerance testing and benchmarking (#72)
    • Placement group fixes (#74)
    • Improved warnings/errors when using incompatible APIs (#76, #82, #84)
    • Enhanced compatibility with XGBoost 0.90 (legacy) and XGBoost 1.4 (#85, #90)
    • Better testing (#72, #87)
    • Minor bug/API fixes (#78, #83, #89)
    Source code(tar.gz)
    Source code(zip)
  • v0.0.4(Mar 18, 2021)

    • Add GCS support (Petastorm) (#63)
    • Enforce labels are set for train/evaluation data (#64)
    • Re-factor data loading structure, making it easier to add or change data loading backends (#66)
    • Distributed and locality-aware data loading for Modin dataframes (#67)
    • Documentation cleanup (#68)
    • Fix RayDeviceQuantileDMatrix usage (#69)
    Source code(tar.gz)
    Source code(zip)
  • v0.0.3(Feb 22, 2021)

    • Added Petastorm integration (#46)
    • Improved Tune integration (#54, #55)
    • Fixed and improved tests (#40, #42, #47, #50, #52, #56, #60)
    • Compatibility with Ray client (#57)
    • Improved fault tolerance handling (#59)
    Source code(tar.gz)
    Source code(zip)
  • v0.0.2(Jan 12, 2021)

  • v0.0.1(Jan 5, 2021)

    Initial version of xgboost on ray, featuring:

    • Distributed training and predict support, tested on clusters of up to 600 nodes
    • Fault tolerance: Restarting the whole run from latest checkpoint if a node fails
    • Fault tolerance: Automatic scaling down/up when nodes die/become available again
    • Data loading from various sources (CSV, Parquet, Modin dataframes, Ray MLDataset, pandas, numpy)
    • Seamless integration with Ray Tune
    • Initial Ray placement group support
    Source code(tar.gz)
    Source code(zip)
Owner
null
Distributed Computing for AI Made Simple

Project Home Blog Documents Paper Media Coverage Join Fiber users email list [email protected] Fiber Distributed Computing for AI Made Simp

Uber Open Source 997 Dec 30, 2022
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

null 2.5k Dec 28, 2022
DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

null 27 Aug 19, 2022
A Software Framework for Neuromorphic Computing

A Software Framework for Neuromorphic Computing

Lava 338 Dec 26, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 3, 2023
XGBoost + Optuna

AutoXGB XGBoost + Optuna: no brainer auto train xgboost directly from CSV files auto tune xgboost using optuna auto serve best xgboot model using fast

abhishek thakur 517 Dec 31, 2022
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
Mortality risk prediction for COVID-19 patients using XGBoost models

Mortality risk prediction for COVID-19 patients using XGBoost models Using demographic and lab test data received from the HM Hospitales in Spain, I b

null 1 Jan 19, 2022
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

null 164 Jan 4, 2023
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 6, 2023
Interactive Parallel Computing in Python

Interactive Parallel Computing with IPython ipyparallel is the new home of IPython.parallel. ipyparallel is a Python package and collection of CLI scr

IPython 2.3k Dec 30, 2022
A basic Ray Tracer that exploits numpy arrays and functions to work fast.

Python-Fast-Raytracer A basic Ray Tracer that exploits numpy arrays and functions to work fast. The code is written keeping as much readability as pos

Rafael de la Fuente 393 Dec 27, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make dis

Horovod 12.9k Jan 7, 2023
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

null 4.1k Jan 9, 2023
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Jan 3, 2023
Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

Max Pumperla 1.6k Dec 29, 2022
Uber Open Source 1.6k Dec 31, 2022