An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Overview

https://github.com/ray-project/ray/raw/master/doc/source/images/ray_header_logo.png

https://readthedocs.org/projects/ray/badge/?version=master https://img.shields.io/badge/Ray-Join%20Slack-blue https://img.shields.io/badge/Discuss-Ask%20Questions-blue

Ray provides a simple, universal API for building distributed applications.

Ray is packaged with the following libraries for accelerating machine learning workloads:

  • Tune: Scalable Hyperparameter Tuning
  • RLlib: Scalable Reinforcement Learning
  • RaySGD: Distributed Training Wrappers
  • Ray Serve: Scalable and Programmable Serving

There are also many community integrations with Ray, including Dask, MARS, Modin, Horovod, Hugging Face, Scikit-learn, and others. Check out the full list of Ray distributed libraries here.

Install Ray with: pip install ray. For nightly wheels, see the Installation page.

Quick Start

Execute Python functions in parallel.

import ray
ray.init()

@ray.remote
def f(x):
    return x * x

futures = [f.remote(i) for i in range(4)]
print(ray.get(futures))

To use Ray's actor model:

import ray
ray.init()

@ray.remote
class Counter(object):
    def __init__(self):
        self.n = 0

    def increment(self):
        self.n += 1

    def read(self):
        return self.n

counters = [Counter.remote() for i in range(4)]
[c.increment.remote() for c in counters]
futures = [c.read.remote() for c in counters]
print(ray.get(futures))

Ray programs can run on a single machine, and can also seamlessly scale to large clusters. To execute the above Ray script in the cloud, just download this configuration file, and run:

ray submit [CLUSTER.YAML] example.py --start

Read more about launching clusters.

Tune Quick Start

https://github.com/ray-project/ray/raw/master/doc/source/images/tune-wide.png

Tune is a library for hyperparameter tuning at any scale.

To run this example, you will need to install the following:

$ pip install "ray[tune]"

This example runs a parallel grid search to optimize an example objective function.

from ray import tune


def objective(step, alpha, beta):
    return (0.1 + alpha * step / 100)**(-1) + beta * 0.1


def training_function(config):
    # Hyperparameters
    alpha, beta = config["alpha"], config["beta"]
    for step in range(10):
        # Iterative training function - can be any arbitrary training procedure.
        intermediate_score = objective(step, alpha, beta)
        # Feed the score back back to Tune.
        tune.report(mean_loss=intermediate_score)


analysis = tune.run(
    training_function,
    config={
        "alpha": tune.grid_search([0.001, 0.01, 0.1]),
        "beta": tune.choice([1, 2, 3])
    })

print("Best config: ", analysis.get_best_config(metric="mean_loss", mode="min"))

# Get a dataframe for analyzing trial results.
df = analysis.results_df

If TensorBoard is installed, automatically visualize all trial results:

tensorboard --logdir ~/ray_results

RLlib Quick Start

https://github.com/ray-project/ray/raw/master/doc/source/images/rllib-wide.jpg

RLlib is an open-source library for reinforcement learning built on top of Ray that offers both high scalability and a unified API for a variety of applications.

pip install tensorflow  # or tensorflow-gpu
pip install "ray[rllib]"
import gym
from gym.spaces import Discrete, Box
from ray import tune

class SimpleCorridor(gym.Env):
    def __init__(self, config):
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0
        self.action_space = Discrete(2)
        self.observation_space = Box(0.0, self.end_pos, shape=(1, ))

    def reset(self):
        self.cur_pos = 0
        return [self.cur_pos]

    def step(self, action):
        if action == 0 and self.cur_pos > 0:
            self.cur_pos -= 1
        elif action == 1:
            self.cur_pos += 1
        done = self.cur_pos >= self.end_pos
        return [self.cur_pos], 1 if done else 0, done, {}

tune.run(
    "PPO",
    config={
        "env": SimpleCorridor,
        "num_workers": 4,
        "env_config": {"corridor_length": 5}})

Ray Serve Quick Start

Ray Serve is a scalable model-serving library built on Ray. It is:

  • Framework Agnostic: Use the same toolkit to serve everything from deep learning models built with frameworks like PyTorch or Tensorflow & Keras to Scikit-Learn models or arbitrary business logic.
  • Python First: Configure your model serving with pure Python code - no more YAMLs or JSON configs.
  • Performance Oriented: Turn on batching, pipelining, and GPU acceleration to increase the throughput of your model.
  • Composition Native: Allow you to create "model pipelines" by composing multiple models together to drive a single prediction.
  • Horizontally Scalable: Serve can linearly scale as you add more machines. Enable your ML-powered service to handle growing traffic.

To run this example, you will need to install the following:

$ pip install scikit-learn
$ pip install "ray[serve]"

This example runs serves a scikit-learn gradient boosting classifier.

from ray import serve
import pickle
import requests
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier

# Train model
iris_dataset = load_iris()
model = GradientBoostingClassifier()
model.fit(iris_dataset["data"], iris_dataset["target"])

# Define Ray Serve model,
class BoostingModel:
    def __init__(self):
        self.model = model
        self.label_list = iris_dataset["target_names"].tolist()

    def __call__(self, flask_request):
        payload = flask_request.json["vector"]
        print("Worker: received flask request with data", payload)

        prediction = self.model.predict([payload])[0]
        human_name = self.label_list[prediction]
        return {"result": human_name}


# Deploy model
client = serve.start()
client.create_backend("iris:v1", BoostingModel)
client.create_endpoint("iris_classifier", backend="iris:v1", route="/iris")

# Query it!
sample_request_input = {"vector": [1.2, 1.0, 1.1, 0.9]}
response = requests.get("http://localhost:8000/iris", json=sample_request_input)
print(response.text)
# Result:
# {
#  "result": "versicolor"
# }

More Information

Older documents:

Getting Involved

Issues
  • [WIP] Implement Ape-X distributed prioritization

    [WIP] Implement Ape-X distributed prioritization

    What do these changes do?

    This implements https://openreview.net/forum?id=H1Dy---0Z for testing. The main ideas from Ape-X are:

    • Worker-side prioritization: rather than take new samples as max priority, prioritize them in workers. This scales experience gathering.
    • Per-worker exploration: Rather than choosing a single exploration schedule, assign each worker a different exploration value ranging from 0.4 to ~0.0.

    WIP: evaluation on pong. This implementation probably doesn't scale to very high sample throughputs, but we should probably be able to see some gains on a couple dozen workers.

    opened by ericl 199
  • [ray-core] Initial addition of performance integration testing files

    [ray-core] Initial addition of performance integration testing files

    • A Dockerfile specific for this test
      • This is needed because we eventually will upload these numbers to S3
    • Addition of simple performance test for time it takes for a number of variable number of tasks with a variable number of CPUs
    • A couple of bash scripts to setup the Docker environment and run the tests

    What do these changes do?

    Related issue number

    opened by devin-petersohn 134
  • Make Bazel the default build system

    Make Bazel the default build system

    What do these changes do?

    This switches the build system from CMake to Bazel for developers.

    The wheels, valgrind tests and Jenkins are currently still run with CMake and will be switched in follow up PRs.

    Related issue number

    opened by pcmoritz 130
  • Streaming data transfer and python integration

    Streaming data transfer and python integration

    Why are these changes needed?

    This is the minimal implementation of streaming data transfer mentioned in doc, consisting of three parts:

    • writer/reader, implemented with C++ to transfer data between streaming workers
    • streaming queue, the transport layer based on Ray’s direct actor call and C++ Core Worker APIs
    • adaption layer for python, implemented with cython to adapt writer/reader for python

    To integrate python with streaming c++ data transfer, following changes are made:

    • We moved python code from python/ray/experimental/streaming/ to streaming/python/ray/streaming, and soft link to python/ray/streaming just like rllib.
    • We removed batched_queue and added cython based streaming queue implementation.
    • We moved execution graph related logic from Environment into ExecutionGraph.
    • We refactored operator_instance into a processor, and added a JobWorker actor to execute processors.

    The java part will be submitted in following PRs.

    Related issue number

    #6184

    Checks

    • [x] I've run scripts/format.sh to lint the changes in this PR.
    • [ ] I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
    • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
    opened by chaokunyang 125
  • [core worker] Python core worker object interface

    [core worker] Python core worker object interface

    What do these changes do?

    This change adds a new Cython extension for the core worker and calls into it for all object-store-related operations.

    To support this, it also adds two new methods to the core worker ObjectInterface that resemble the plasma store interface (Create and Seal). These allow us to directly write from Python memory into the object store via Arrow's SerializedPyObject without having to compile the Ray and Arrow Cython code together or add a dependency to Arrow in the core worker.

    Related issue number

    Linter

    • [x] I've run scripts/format.sh to lint the changes in this PR.
    opened by edoakes 117
  • Discussion on batch Garbage Collection.

    Discussion on batch Garbage Collection.

    Hi @robertnishihara @pcmoritz , we are planning to add a batch Garbage Collection to Ray.

    We have a concept called batchId (int64_t) used to do the Garbage Collection. For example, one job will use this batchId to generate all the objectIds and taskIds, and all these objectIds and taskIds will be stored under the Garbage Collection Table under the batchId in GCS. When the job is finished, we can simply pass a batchId to the garbage collector and the garbage collector will look up the Garbage Collection table in GCS and do the garbage collection to all the related tasks and objects.

    In current id.h implementation, the lowest 32 bits in ObjectId is used for Object Index. We can use the higher 64 bits next to the Object Index as the batchId and add a new GC Table in GCS.

    This GC mechanism will help release the memory resources in GCS and plasma. How do you think of this code change?

    opened by guoyuhong 112
  • [tune] Cluster Fault Tolerance

    [tune] Cluster Fault Tolerance

    What do these changes do?

    A redo of #3165 with extraneous cleanup changes removed.

    This currently does not use the same restoring code-path as #3238, but this can change later when component FT is implemented... (i.e., this doesn't notify components that some trials go RUNNING -> PENDING).

    This adds the following functionality:

    • pickleable trials and TrialRunner.
    • checkpointing/restoring functionality for Trial runner
    • user endpoints for experiment checkpointing

    Example:

    
    In [6]: import time
       ...: import ray
       ...: from ray import tune
       ...:
       ...: ray.init()
       ...:
       ...: kwargs = dict(
       ...:     run="__fake",
       ...:     stop=dict(training_iteration=5),
       ...:     checkpoint_freq=1,
       ...:     max_failures=1)
       ...:
       ...: # This will save the experiment state to disk on each step
       ...: tune.run_experiments(
       ...:     dict(experiment1=kwargs),
       ...:     raise_on_failed_trial=False)
       ...:
    

    TODO:

    • [x] User endpoints implemented.
    • [x] NODE FT: Add test for scheduler notification when nodes die and trials running -> pending

    NOTE: this should be a lot easier to review after #3414 is merged.

    opened by richardliaw 110
  • GCS-Based actor management implementation

    GCS-Based actor management implementation

    Why are these changes needed?

    Pls see the <Design Document> first.

    This PR implements the creation and reconstruction of actors based on gcs server.

    Changes on gcs server side

    Several important classes are added: GcsActor, GcsActorManager, GcsActorScheduler.

    • GcsActor: An abstraction of actor at GcsServer side, which wrapper the ActorTableData and provides some simple interface to access the field inside ActorTableData.
    • GcsActorManager: It is responsible for managing the lifecycle of all registered actors.
    • GcsActorScheduler: It is responsible for scheduling actors registered to GcsActorManager, it also contains a inner class called GcsLeasedWorker which is an abstraction of remote leased worker in raylet.

    In addition, this PR has also made some changes to GcsNodeManager, it is responsible for monitoring and manage nodes.

    Changes on raylet side

    • In the old actor management scheme, raylet will be responsible for updating ActorTableData, while in the new GCS-Based actor management scheme, we expect that GCS will be responsible for updating all ActorTableData. So, you will see that all logic about updating ActorTableData will be get ride off.
    • Besides, the raylet should cache the relationship of actor and leased worker because that the raylet should fast reply gcs server without lease anything when gcs server rebuild actors after restart. Pls see the <Design Document>.

    Chages on worker side

    • invoke the gcs_rpc_client.CreateActor on the callback of ResolveDependencies.
    • Fast reply the gcs server without create anyting if it is already bound with an actor when gcs server rebuild actors rebuild actors after restart.

    Related issue number

    Checks

    • [ ] I've run scripts/format.sh to lint the changes in this PR.
    • [ ] I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
    • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
    opened by wumuzi520 107
  • [carla] [rllib] Add support for carla nav planner and scenarios from paper

    [carla] [rllib] Add support for carla nav planner and scenarios from paper

    What do these changes do?

    This adds navigation input from the carla planner, and also the ability to run all the scenarios from the CoRL 2017 paper.

    Train scripts are updated to use a custom model that supports the planner input and nav metrics.

    opened by ericl 107
  • [RLlib] Move all jenkins RLlib-tests into bazel (rllib/BUILD).

    [RLlib] Move all jenkins RLlib-tests into bazel (rllib/BUILD).

    Why are these changes needed?

    Related issue number

    Checks

    • [x] I've run scripts/format.sh to lint the changes in this PR.
    • [x] I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
    • [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
    opened by sven1977 102
  • Experimental asyncio support

    Experimental asyncio support

    What do these changes do?

    This is a prototype implementation for https://github.com/ray-project/ray/issues/493 which provides awaitable interface for ray.wait & ray's ObjectID

    As a prototype, these codes are meant to be modified later.

    How do these changes work?

    1. AsyncPlasmaClient is implemeted to override original pyarrow.plasma.PlasmaClient. pyarrow.plasma.PlasmaClient is created by pyarrow.plasma.connect and is attached to ray.worker.global_worker to handle basic ray functions. It also create an interface for wrapping ray's ObjectID.
    2. AsyncPlasmaSocket is created for async socket messaging with PlasmaStore & PlasmaManager. It is the core of async. pyarrow.plasma.PlasmaClient does not make use of event loops and only create a single socket connection, it is why original ray does not support much of async functions. AsyncPlasmaSocket uses asyncio event loop and is capable of creating multiple socket connections with PlasmaManager.
    3. plasma.fbs under format directory needs to be compiled with flatbuffer ahead of time.

    Related issue number

    https://github.com/ray-project/ray/issues/493

    cc @mitar

    opened by suquark 101
  • [air] data_batch_conversion doesn't work properly

    [air] data_batch_conversion doesn't work properly

    What happened + What you expected to happen

    https://github.com/ray-project/ray/blob/75d08b06328d213656e7280639b35ccecdfc34d0/python/ray/air/util/data_batch_conversion.py#L78-L81

    This logic doesn't work --

    1. I write my preprocessors to concat my dataframe into generate a single column with a list of tensors (DataFrame["Single"])
    2. But when I try to call inference with BatchPredictor, the above code converts that single column into an array of lists, which doesn't make it possible to further tensorize

    Code example below

    Small code snippet:

    df = pd.DataFrame([[[123,234,234]], [[124,25,235]], [[1267,267,2345]]], columns = ['A']) 
    df.to_numpy()                                                                                                                                  
    Out[9]: 
    array([[list([123, 234, 234])],
           [list([124, 25, 235])],
           [list([1267, 267, 2345])]], dtype=object)
    

    Versions / Dependencies

    Master

    Reproduction script

    
    # isort: skip_file
    
    # __air_pytorch_preprocess_start__
    import numpy as np
    import ray
    from ray.data.preprocessors import StandardScaler, BatchMapper, Chain
    
    import pandas as pd
    
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
    
    data_raw = load_breast_cancer()
    dataset_df = pd.DataFrame(data_raw["data"], columns=data_raw["feature_names"])
    dataset_df["target"] = data_raw["target"]
    
    train_df, test_df = train_test_split(dataset_df, test_size=0.3)
    train_dataset = ray.data.from_pandas(train_df)
    valid_dataset = ray.data.from_pandas(test_df)
    test_dataset = ray.data.from_pandas(test_df.drop("target", axis=1))
    
    schema_order = [k for k in train_dataset.schema().names if k != "target"]
    
    def concat_for_tensor(dataframe):
        result = []
        for i, single_dict in dataframe.iterrows():
            tensor = [single_dict[key] for key in schema_order]
            result_dict = {"input": tensor}
            if "target" in single_dict:
                 result_dict["target"] = single_dict["target"]
            result.append(result_dict)
        return  pd.DataFrame(result)
    
    # Create a preprocessor to scale some columns
    columns_to_scale = ["mean radius", "mean texture"]
    preprocessor = Chain(
        StandardScaler(columns=columns_to_scale),
        BatchMapper(concat_for_tensor)
    )
    # __air_pytorch_preprocess_end__
    
    
    # __air_pytorch_train_start__
    import torch
    import torch.nn as nn
    from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
    
    from ray import train
    from ray.train.torch import TorchTrainer
    
    def create_model(input_features):
        return nn.Sequential(
            nn.Linear(in_features=input_features, out_features=16),
            nn.ReLU(),
            nn.Linear(16, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
            nn.Sigmoid())
    
    
    def validate_epoch(dataloader, model, loss_fn):
        model.eval()
        test_loss, correct = 0, 0
        evaluated = False
        with torch.no_grad():
            for idx, (inputs, labels) in enumerate(dataloader):
                evaluated = True
                pred = model(inputs)
                test_loss += loss_fn(pred, labels.unsqueeze(1)).item()
                correct += (pred.argmax(1) == labels).type(torch.float).sum().item()
        if not evaluated:
            return
        test_loss /= (idx + 1)
        return test_loss
    
    
    def train_loop_per_worker(config):
        batch_size = config["batch_size"]
        lr = config["lr"]
        epochs = config["num_epochs"]
        num_features = config["num_features"]
    
        # Create data loaders.
        # Get the Ray Dataset shard for this data parallel worker, and convert it to a PyTorch Dataset.
        train_iterator = train.get_dataset_shard("train").iter_batches(
            batch_format="numpy", batch_size=batch_size)
        val_dataloader = train.get_dataset_shard("validate").iter_batches(
            batch_format="numpy", batch_size=batch_size)
    
        def to_tensor_iterator(dict_iterator):
            for d in dict_iterator:
                np_input = np.vstack(d["input"])
                np_label = d["target"]
                yield torch.Tensor(np_input).float(), torch.Tensor(np_label).float()
    
        # Create model.
        model = create_model(num_features)
        model = train.torch.prepare_model(model)
    
        loss_fn = nn.BCELoss()
        optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    
        for _ in range(epochs):
            for inputs, labels in to_tensor_iterator(train_iterator):
                optimizer.zero_grad()
                predictions = model(inputs) 
                loss = loss_fn(predictions, labels.unsqueeze(1))
                loss.backward()
                optimizer.step()
    
            loss = validate_epoch(to_tensor_iterator(val_dataloader), model, loss_fn)
            train.report(loss=loss)
    
            # Checkpoint model after every epoch.
            state_dict = model.state_dict()
            consume_prefix_in_state_dict_if_present(state_dict, "module.")
            train.save_checkpoint(model=state_dict)
    
    num_workers = 2
    use_gpu = False  # use GPUs if detected.
    # Number of epochs to train each task for.
    num_epochs = 4
    # Batch size.
    batch_size = 32
    # Optimizer args.
    learning_rate = 0.001
    momentum = 0.9
    # Get the number of columns of datset
    num_features = len(schema_order)
    
    trainer = TorchTrainer(
        train_loop_per_worker=train_loop_per_worker,
        train_loop_config={
            "num_epochs": num_epochs,
            "lr": learning_rate,
            "momentum": momentum,
            "batch_size": batch_size,
            "num_features": num_features,
        },
        # Have to specify trainer_resources as 0 so that the example works on Colab. 
        scaling_config={"num_workers": num_workers, "use_gpu": use_gpu, "trainer_resources": {"CPU": 0}},
        datasets={"train": train_dataset, "validate": valid_dataset},
        preprocessor=preprocessor
    )
    
    result = trainer.fit()
    print(f"Last result: {result.metrics}")
    # __air_pytorch_train_end__
    
    # __air_pytorch_batchpred_start__
    from ray.train.batch_predictor import BatchPredictor
    from ray.train.torch import TorchPredictor
    
    batch_predictor = BatchPredictor.from_checkpoint(
        result.checkpoint, TorchPredictor, model=create_model(num_features))
    
    # print(test_dataset.show())
    predicted_labels = (
        batch_predictor.predict(test_dataset)
        .map_batches(lambda df: (df > 0.5).astype(int), batch_format="pandas")
        .to_pandas(limit=float("inf"))
    )
    print("PREDICTED LABELS")
    print(f"{predicted_labels}")
    # __air_pytorch_batchpred_end__
    

    Issue Severity

    High: It blocks me from completing my task.

    bug air 
    opened by richardliaw 4
  • [doc][tune] Compatibility table for `Searcher`s and Tune primitives

    [doc][tune] Compatibility table for `Searcher`s and Tune primitives

    Description

    am trying to study tune internals and wanted to look at source code for a Searcher that supports tune.grid_search—but there's no easy way to find this info

    would be nice if there was a compatibility table to show what Tune primitives are allowed for each search alg

    Link

    No response

    tune docs 
    opened by sumanthratna 0
  • Gnormmixin

    Gnormmixin

    Why are these changes needed?

    Currently, the global gnorm is not added to metrics in all cases for impala and APPO.

    Checks

    • [x] I've run scripts/format.sh to lint the changes in this PR.
    • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    • Testing Strategy
      • [x] Unit tests
      • [ ] Release tests
      • [ ] This PR is not tested :(
    opened by ArturNiederfahrenhorst 0
  • Update import sorting blacklist, enable sorting for experimental dir

    Update import sorting blacklist, enable sorting for experimental dir

    Why are these changes needed?

    There are directories that we don't lint / format. Ensure they are also the case for the import sorting tool.

    Enable sorting for python/experimental to show case how to enable sorting for a directory as we convert more of the directories to be automatically sorted by the tool.

    Related issue number

    https://github.com/ray-project/ray/pull/25678

    Checks

    • [x] I've run scripts/format.sh to lint the changes in this PR.
    • [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
    • Testing Strategy
      • [x] Unit tests
      • [ ] Release tests
      • [ ] This PR is not tested :(
    opened by clarng 1
  • [data](deps): Bump dask[complete] from 2022.2.0 to 2022.6.1 in /python/requirements/data_processing

    [data](deps): Bump dask[complete] from 2022.2.0 to 2022.6.1 in /python/requirements/data_processing

    Bumps dask[complete] from 2022.2.0 to 2022.6.1.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies python 
    opened by dependabot[bot] 0
  • [data](deps): Bump moto[s3] from 2.3.1 to 3.1.14 in /python/requirements/data_processing

    [data](deps): Bump moto[s3] from 2.3.1 to 3.1.14 in /python/requirements/data_processing

    Bumps moto[s3] from 2.3.1 to 3.1.14.

    Changelog

    Sourced from moto[s3]'s changelog.

    3.1.14

    Docker Digest for 3.1.14: sha256:a8ad7f54d7c469e34454f6774f743251c02093c6b2d7e9d7961a5de366783e11

    New Methods:
        * Greengrass:
            * create_function_definition()
            * create_resource_definition()
            * create_function_definition_version()
            * create_resource_definition_version()
            * create_subscription_definition()
            * create_subscription_definition_version()
            * delete_function_definition()
            * delete_resource_definition()
            * delete_subscription_definition()
            * get_function_definition()
            * get_function_definition_version()
            * get_resource_definition()
            * get_resource_definition_version()
            * get_subscription_definition()
            * get_subscription_definition_version()
            * list_function_definitions()
            * list_function_definition_versions()
            * list_resource_definitions()
            * list_resource_definition_versions()
            * list_subscription_definitions()
            * list_subscription_definition_versions()
            * update_function_definition()
            * update_resource_definition()
            * update_subscription_definition()
        * EKS:
            * list_tags_for_resources()
            * tag_resource()
            * untag_resource()
        * Route53:
            * associate_vpc()
            * disassociate_vpc_from_hosted_zone()
            * update_health_check()
            * update_hosted_zone_comment()
    

    Miscellaneous: * APIGateway:put_integration() now supports the requestParameters-parameter * EC2:create_route() now validates whether a route already exists

    3.1.13

    Docker Digest for 3.1.13: sha256:d7f6c779c79f03b686747ae26b52bdca26fd81a50c6a41a8a6cba50c96982abf

    New Methods:
    

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies python 
    opened by dependabot[bot] 0
Releases(ray-1.13.0)
  • ray-1.13.0(Jun 9, 2022)

    Highlights:

    • Python 3.10 support is now in alpha.
    • Ray usage stats collection is now on by default (guarded by an opt-out prompt).
    • Ray Tune can now synchronize Trial data from worker nodes via the object store (without rsync!)
    • Ray Workflow comes with a new API and is integrated with Ray DAG.

    Ray Autoscaler

    💫Enhancements:

    • CI tests for KubeRay autoscaler integration (#23365, #23383, #24195)
    • Stability enhancements for KubeRay autoscaler integration (#23428)

    🔨 Fixes:

    • Improved GPU support in KubeRay autoscaler integration (#23383)
    • Resources scheduled with the node affinity strategy are not reported to the autoscaler (#24250)

    Ray Client

    💫Enhancements:

    • Add option to configure ray.get with >2 sec timeout (#22165)
    • Return None from internal KV for non-existent keys (#24058)

    🔨 Fixes:

    • Fix deadlock by switching to SimpleQueue on Python 3.7 and newer in async dataclient (#23995)

    Ray Core

    🎉 New Features:

    • Ray usage stats collection is now on by default (guarded by an opt-out prompt)
    • Alpha support for python 3.10 (on Linux and Mac)
    • Node affinity scheduling strategy (#23381)
    • Add metrics for disk and network I/O (#23546)
    • Improve exponential backoff when connecting to the redis (#24150)
    • Add the ability to inject a setup hook for customization of runtime_env on init (#24036)
    • Add a utility to check GCS / Ray cluster health (#23382)

    🔨 Fixes:

    • Fixed internal storage S3 bugs (#24167)
    • Ensure "get_if_exists" takes effect in the decorator. (#24287)
    • Reduce memory usage for Pubsub channels that do not require total memory cap (#23985)
    • Add memory buffer limit in publisher for each subscribed entity (#23707)
    • Use gRPC instead of socket for GCS client health check (#23939)
    • Trim size of Reference struct (#23853)
    • Enable debugging into pickle backend (#23854)

    🏗 Architecture refactoring:

    • Gcs storage interfaces unification (#24211)
    • Cleanup pickle5 version check (#23885)
    • Simplify options handling (#23882)
    • Moved function and actor importer away from pubsub (#24132)
    • Replace the legacy ResourceSet & SchedulingResources at Raylet (#23173)
    • Unification of AddSpilledUrl and UpdateObjectLocationBatch RPCs (#23872)
    • Save task spec in separate table (#22650)

    Ray Datasets

    🎉 New Features:

    • Performance improvement: the aggregation computation is vectorized (#23478)
    • Performance improvement: bulk parquet file reading is optimized with the fast metadata provider (#23179)
    • Performance improvement: more efficient move semantics for Datasets block processing (#24127)
    • Supports Datasets lineage serialization (aka out-of-band serialization) (#23821, #23931, #23932)
    • Supports native Tensor views in map processing for pure-tensor datasets (#24812)
    • Implemented push-based shuffle (#24281)

    🔨 Fixes:

    • Documentation improvement: Getting Started page (#24860)
    • Documentation improvement: FAQ (#24932)
    • Documentation improvement: End to end examples (#24874)
    • Documentation improvement: Feature guide - Creating Datasets (#24831)
    • Documentation improvement: Feature guide - Saving Datasets (#24987)
    • Documentation improvement: Feature guide - Transforming Datasets (#25033)
    • Documentation improvement: Datasets APIs docstrings (#24949)
    • Performance: fixed block prefetching (#23952)
    • Fixed zip() for Pandas dataset (#23532)

    🏗 Architecture refactoring:

    • Refactored LazyBlockList (#23624)
    • Added path-partitioning support for all content types (#23624)
    • Added fast metadata provider and refactored Parquet datasource (#24094)

    RLlib

    🎉 New Features:

    • Replay buffer API: First algorithms are using the new replay buffer API, allowing users to define and configure their own custom buffers or use RLlib’s built-in ones: SimpleQ, DQN (#24164, #22842, #23523, #23586)

    🏗 Architecture refactoring:

    • More algorithms moved into the training iteration function API (no longer using execution plans). Users can now more easily read, develop, and debug RLlib’s algorithms: A2C, APEX-DQN, CQL, DD-PPO, DQN, MARWIL + BC, PPO, QMIX , SAC, SimpleQ, SlateQ, Trainers defined in examples folder. (#22937, #23420, #23673, #24164, #24151, #23735, #24157, #23798, #23906, #24118, #22842, #24166, #23712). This will be fully completed and documented with Ray 2.0.
    • Make RolloutWorkers (optionally) recoverable after failure via the new recreate_failed_workers=True config flag. (#23739)
    • POC for new TrainerConfig objects (instead of python config dicts): PPOConfig (for PPOTrainer) and PGConfig (for PGTrainer). (#24295, #23491)
    • Hard-deprecate build_trainer() (trainer_templates.py): All custom Trainers should now sub-class from any existing Trainer class. (#23488)

    💫Enhancements:

    • Add support for complex observations in CQL. (#23332)
    • Bandit support for tf2. (#22838)
    • Make actions sent by RLlib to the env immutable. (#24262)
    • Memory leak finding toolset using tracemalloc + CI memory leak tests. (#15412)
    • Enable DD-PPO to run on Windows. (#23673)

    🔨 Fixes:

    • APPO eager fix (APPOTFPolicy gets wrapped as_eager() twice by mistake). (#24268)
    • CQL gets stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead). (#24345)
    • SlateQ runs on GPU (torch). (#23464)
    • Other bug fixes: #24016, #22050, #23814, #24025, #23740, #23741, #24006, #24005, #24273, #22010, #24271, #23690, #24343, #23419, #23830, #24335, #24148, #21735, #24214, #23818, #24429

    Ray Workflow

    🎉 New Features:

    • Workflow step is deprecated (#23796, #23728, #23456, #24210)

    🔨 Fixes:

    • Fix one bug where max_retries is not aligned with ray core’s max_retries. (#22903)

    🏗 Architecture refactoring:

    • Integrate ray storage in workflow (#24120)

    Tune

    🎉 New Features:

    • Add RemoteTask based sync client (#23605) (rsync not required anymore!)
    • Chunk file transfers in cross-node checkpoint syncing (#23804)
    • Also interrupt training when SIGUSR1 received (#24015)
    • reuse_actors per default for function trainables (#24040)
    • Enable AsyncHyperband to continue training for last trials after max_t (#24222)

    💫Enhancements:

    • Improve testing (#23229
    • Improve docstrings (#23375)
    • Improve documentation (#23477, #23924)
    • Simplify trial executor logic (#23396
    • Make MLflowLoggerUtil copyable (#23333)
    • Use new Checkpoint interface internally (#22801)
    • Beautify Optional typehints (#23692)
    • Improve missing search dependency info (#23691)
    • Skip tmp checkpoints in analysis and read iteration from metadata (#23859)
    • Treat checkpoints with nan value as worst (#23862)
    • Clean up base ProgressReporter API (#24010)
    • De-clutter log outputs in trial runner (#24257)
    • hyperopt searcher to support tune.choice([[1,2],[3,4]]). (#24181)

    🔨Fixes:

    • Optuna should ignore additional results after trial termination (#23495)
    • Fix PTL multi GPU link (#23589)
    • Improve Tune cloud release tests for durable storage (#23277)
    • Fix tensorflow distributed trainable docstring (#23590)
    • Simplify experiment tag formatting, clean directory names (#23672)
    • Don't include nan metrics for best checkpoint (#23820)
    • Fix syncing between nodes in placement groups (#23864)
    • Fix memory resources for head bundle (#23861)
    • Fix empty CSV headers on trial restart (#23860)
    • Fix checkpoint sorting with nan values (#23909)
    • Make Timeout stopper work after restoring in the future (#24217)
    • Small fixes to tune-distributed for new restore modes (#24220)

    Train

    Most distributed training enhancements will be captured in the new Ray AIR category!

    🔨Fixes:

    • Copy resources_per_worker to avoid modifying user input
    • Fix train.torch.get_device() for fractional GPU or multiple GPU per worker case (#23763)
    • Fix multi node horovod bug (#22564)
    • Fully deprecate Ray SGD v1 (#24038)
    • Improvements to fault tolerance (#22511)
    • MLflow start run under correct experiment (#23662)
    • Raise helpful error when required backend isn't installed (#23583)
    • Warn pending deprecation for ray.train.Trainer and ray.tune DistributedTrainableCreators (#24056)

    📖Documentation:

    • add FAQ (#22757)

    Ray AIR

    🎉 New Features:

    • HuggingFaceTrainer & HuggingFacePredictor (#23615, #23876)
    • SklearnTrainer & SklearnPredictor (#23803, #23850)
    • HorovodTrainer (#23437)
    • RLTrainer & RLPredictor (#23465, #24172)
    • BatchMapper preprocessor (#23700)
    • Categorizer preprocessor (#24180)
    • BatchPredictor (#23808)

    💫Enhancements:

    • Add Checkpoint.as_directory() for efficient checkpoint fs processing (#23908)
    • Add config to Result, extend ResultGrid.get_best_config (#23698)
    • Add Scaling Config validation (#23889)
    • Add tuner test. (#23364)
    • Move storage handling to pyarrow.fs.FileSystem (#23370)
    • Refactor _get_unique_value_indices (#24144)
    • Refactor most_frequent SimpleImputer (#23706)
    • Set name of Trainable to match with Trainer #23697
    • Use checkpoint.as_directory() instead of cleaning up manually (#24113)
    • Improve file packing/unpacking (#23621)
    • Make Dataset ingest configurable (#24066)
    • Remove postprocess_checkpoint (#24297)

    🔨Fixes:

    • Better exception handling (#23695)
    • Do not deepcopy RunConfig (#23499)
    • reduce unnecessary stacktrace (#23475)
    • Tuner should use run_config from Trainer per default (#24079)
    • Use custom fsspec handler for GS (#24008)

    📖Documentation:

    • Add distributed torch_geometric example (#23580)
    • GNN example cleanup (#24080)

    Serve

    🎉 New Features:

    • Serve logging system was revamped! Access log is now turned on by default. (#23558)
    • New Gradio notebook example for Ray Serve deployments (#23494)
    • Serve now includes full traceback in deployment update error message (#23752)

    💫Enhancements:

    • Serve Deployment Graph was enhanced with performance fixes and structural clean up. (#24199, #24026, #24065, #23984)
    • End to end tutorial for deployment graph (#23512, #22771, #23536)
    • input_schema is now renamed as http_adapter for usability (#24353, #24191)
    • Progress towards a declarative REST API (#23232, #23481)
    • Code cleanup and refactoring (#24067, #23578, #23934, #23759)
    • Protobuf based controller API for cross language client (#23004)

    🔨Fixes:

    • Handle None in ReplicaConfig's resource_dict (#23851)
    • Set "memory" to None in ray_actor_options by default (#23619)
    • Make serve.shutdown() shutdown remote Serve applications (#23476)
    • Ensure replica reconfigure runs after allocation check (#24052)
    • Allow cloudpickle serializable objects as init args/kwargs (#24034)
    • Use controller namespace when getting actors (#23896)

    Dashboard

    🔨Fixes:

    • Add toggle to enable showing node disk usage on K8s (#24416, #24440)
    • Add job submission id as field to job snapshot (#24303)

    Thanks Many thanks to all those who contributed to this release! @matthewdeng, @scv119, @xychu, @iycheng, @takeshi-yoshimura, @iasoon, @wumuzi520, @thetwotravelers, @maxpumperla, @krfricke, @jgiannuzzi, @kinalmehta, @avnishn, @dependabot[bot], @sven1977, @raulchen, @acxz, @stephanie-wang, @mgelbart, @xwjiang2010, @jon-chuang, @pdames, @ericl, @edoakes, @gjoseph92, @ddelange, @bkasper, @sriram-anyscale, @Zyiqin-Miranda, @rkooo567, @jbedorf, @architkulkarni, @osanseviero, @simonsays1980, @clarkzinzow, @DmitriGekhtman, @ashione, @smorad, @andenrx, @mattip, @bveeramani, @chaokunyang, @richardliaw, @larrylian, @Chong-Li, @fwitter, @shrekris-anyscale, @gjoliver, @simontindemans, @silky, @grypesc, @ijrsvt, @daikeshi, @kouroshHakha, @mwtian, @mesjou, @sihanwang41, @PavelCz, @czgdp1807, @jianoaix, @GuillaumeDesforges, @pcmoritz, @arsedler9, @n30111, @kira-lin, @ckw017, @max0x7ba, @Yard1, @XuehaiPan, @lchu-ibm, @HJasperson, @SongGuyang, @amogkam, @liuyang-my, @WangTaoTheTonic, @jovany-wang, @simon-mo, @dynamicwebpaige, @suquark, @ArturNiederfahrenhorst, @jjyao, @KepingYan, @jiaodong, @frosk1

    Source code(tar.gz)
    Source code(zip)
  • ray-1.12.1(May 16, 2022)

    Patch release with the following fixes:

    • Ray now works on Google Colab again! The bug with memory limit fetching when running Ray in a container is now fixed (https://github.com/ray-project/ray/pull/23922).
    • ray-ml Docker images for CPU will start being built again after they were stopped in Ray 1.9 (https://github.com/ray-project/ray/pull/24266).
    • [Train/Tune] Start MLflow run under the correct experiment for Ray Train and Ray Tune integrations (https://github.com/ray-project/ray/pull/23662).
    • [RLlib] Fix for APPO in eager mode (https://github.com/ray-project/ray/pull/24268).
    • [RLlib] Fix Alphastar for TF2 and tracing enabled (https://github.com/ray-project/ray/commit/c5502b2aa57376b26408bb297ff68696c16f48f1).
    • [Serve] Fix replica leak in anonymous namespaces (https://github.com/ray-project/ray/pull/24311).
    Source code(tar.gz)
    Source code(zip)
  • ray-1.11.1(May 10, 2022)

    Patch release including fixes for the following issues:

    • Ray Job Submission not working with remote working_dir URLs in their runtime environment (https://github.com/ray-project/ray/pull/22018)
    • Ray Tune + MLflow integration failing to set MLflow experiment ID (https://github.com/ray-project/ray/pull/23662)
    • Dependencies for gym not pinned, leading to version incompatibility issues (https://github.com/ray-project/ray/pull/23705)
    Source code(tar.gz)
    Source code(zip)
  • ray-1.12.0(Apr 8, 2022)

    Highlights

    • Ray AI Runtime (AIR), an open-source toolkit for building end-to-end ML applications on Ray, is now in Alpha. AIR is an effort to unify the experience of using different Ray libraries (Ray Data, Train, Tune, Serve, RLlib). You can find more information on the docs or on the public RFC.
      • Getting involved with Ray AIR. We’ll be holding office hours, development sprints, and other activities as we get closer to the Ray AIR Beta/GA release. Want to join us? Fill out this short form!
    • Ray usage data collection is now off by default. If you have any questions or concerns, please comment on the RFC.
    • New algorithms are added to RLlib: SlateQ & Bandits (for recommender systems use cases) and AlphaStar (multi-agent, multi-GPU w/ league-based self-play)
    • Ray Datasets: new lazy execution model with automatic task fusion and memory-optimizing move semantics; first-class support for Pandas DataFrame blocks; efficient random access datasets.

    Ray Autoscaler

    🎉 New Features

    • Support cache_stopped_nodes on Azure (#21747)
    • AWS Cloudwatch support (#21523)

    💫 Enhancements

    • Improved documentation and standards around built in autoscaler node providers. (#22236, 22237)
    • Improved KubeRay support (#22987, #22847, #22348, #22188)
    • Remove redis requirement (#22083)

    🔨 Fixes

    • No longer print infeasible warnings for internal placement group resources. Placement groups which cannot be satisfied by the autoscaler still trigger warnings. (#22235)
    • Default ami’s per AWS region are updated/fixed. (#22506)
    • GCP node termination updated (#23101)
    • Retry legacy k8s operator on monitor failure (#22792)
    • Cap min and max workers for manually managed on-prem clusters (#21710)
    • Fix initialization artifacts (#22570)
    • Ensure initial scaleup with high upscaling_speed isn't limited. (#21953)

    Ray Client

    🎉 New Features:

    • ray.init has consistent return value in client mode and driver mode #21355

    💫Enhancements:

    • Gets and puts are streamed to support arbitrary object sizes #22100, #22327

    🔨 Fixes:

    • Fix ray client object ref releasing in wrong context #22025

    Ray Core

    🎉 New Features

    • RuntimeEnv:
      • Support setting timeout for runtime_env setup. (#23082)
      • Support setting pip_check and pip_version for runtime_env. (#22826, #23306)
      • env_vars will take effect when the pip install command is executed. (temporarily ineffective in conda) (#22730)
      • Support strongly-typed API ray.runtime.RuntimeEnv to define runtime env. (#22522)
      • Introduce virtualenv to isolate the pip type runtime env. (#21801,#22309)
    • Raylet shares fate with the dashboard agent. And the dashboard agent will stay alive when it catches the port conflicts. (#22382,#23024)
    • Enable dashboard in the minimal ray installation (#21896)
    • Add task and object reconstruction status to ray memory cli tools(#22317)

    🔨 Fixes

    • Report only memory usage of pinned object copies to improve scaledown. (#22020)
    • Scheduler:
      • No spreading if a node is selected for lease request due to locality. (#22015)
      • Placement group scheduling: Non-STRICT_PACK PGs should be sorted by resource priority, size (#22762)
      • Round robin during spread scheduling (#21303)
    • Object store:
      • Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120)
      • Cleanup handling for nondeterministic object size during transfer (#22639)
      • Fix bug in fusion for spilled objects (#22571)
      • Handle IO worker failures correctly (#20752)
    • Improve ray stop behavior (#22159)
    • Avoid warning when receiving too much logs from a different job (#22102)
    • Gcs resource manager bug fix and clean up. (#22462, #22459)
    • Release GIL when running parallel_memcopy() / memcpy() during serializations. (#22492)
    • Fix registering serializer before initializing Ray. (#23031)

    🏗 Architecture refactoring

    • Ray distributed scheduler refactoring: (#21927, #21992, #22160, #22359, #22722, #22817, #22880, #22893, #22885, #22597, #22857, #23124)
    • Removed support for bootstrapping with Redis.

    Ray Data Processing

    🎉 New Features

    • Big Performance and Stability Improvements:
      • Add lazy execution mode with automatic stage fusion and optimized memory reclamation via block move semantics (#22233, #22374, #22373, #22476)
      • Support for random access datasets, providing efficient random access to rows via binary search (#22749)
      • Add automatic round-robin load balancing for reading and shuffle reduce tasks, obviating the need for the _spread_resource_prefix hack (#21303)
    • More Efficient Tabular Data Wrangling:
      • Add first-class support for Pandas blocks, removing expensive Arrow <-> Pandas conversion costs (#21894)
      • Expose TableRow API + minimize copies/type-conversions on row-based ops (#22305)
    • Groupby + Aggregations Improvements:
      • Support mapping over groupby groups (#22715)
      • Support ignoring nulls in aggregations (#20787)
    • Improved Dataset Windowing:
      • Support windowing a dataset by bytes instead of number of blocks (#22577)
      • Batch across windows in DatasetPipelines (#22830)
    • Better Text I/O:
      • Support streaming snappy compression for text files (#22486)
      • Allow for custom decoding error handling in read_text() (#21967)
      • Add option for dropping empty lines in read_text() (#22298)
    • New Operations:
      • Add add_column() utility for adding derived columns (#21967)
    • Support for metadata provider callback for read APIs (#22896)
    • Support configuring autoscaling actor pool size (#22574)

    🔨 Fixes

    • Force lazy datasource materialization in order to respect DatasetPipeline stage boundaries (#21970)
    • Simplify lifetime of designated block owner actor, and don’t create it if dynamic block splitting is disabled (#22007)
    • Respect 0 CPU resource request when using manual resource-based load balancing (#22017)
    • Remove batch format ambiguity by always converting Arrow batches to Pandas when batch_format=”native” is given (#21566)
    • Fix leaked stats actor handle due to closure capture reference counting bug (#22156)
    • Fix boolean tensor column representation and slicing (#22323)
    • Fix unhandled empty block edge case in shuffle (#22367)
    • Fix unserializable Arrow Partitioning spec (#22477)
    • Fix incorrect iter_epochs() batch format (#22550)
    • Fix infinite iter_epochs() loop on unconsumed epochs (#22572)
    • Fix infinite hang on split() when num_shards < num_rows (#22559)
    • Patch Parquet file fragment serialization to prevent metadata fetching (#22665)
    • Don’t reuse task workers for actors or GPU tasks (#22482)
    • Pin pipeline executor actors to driver node to allow for lineage-based fault tolerance for pipelines (#​​22715)
    • Always use non-empty blocks to determine schema (#22834)
    • API fix bash (#22886)
    • Make label_column optional for to_tf() so it can be used for inference (#22916)
    • Fix schema() for DatasetPipelines (#23032)
    • Fix equalized split when num_splits == num_blocks (#23191)

    💫 Enhancements

    • Optimize Parquet metadata serialization via batching (#21963)
    • Optimize metadata read/write for Ray Client (#21939)
    • Add sanity checks for memory utilization (#22642)

    🏗 Architecture refactoring

    • Use threadpool to submit DatasetPipeline stages (#22912)

    RLlib

    🎉 New Features

    • New “AlphaStar” algorithm: A parallelized, multi-agent/multi-GPU learning algorithm, implementing league-based self-play. (#21356, #21649)
    • SlateQ algorithm has been re-tested, upgraded (multi-GPU capable, TensorFlow version), and bug-fixed (added to weekly learning tests). (#22389, #23276, #22544, #22543, #23168, #21827, #22738)
    • Bandit algorithms: Moved into agents folder as first-class citizens, TensorFlow-Version, unified w/ other agents’ APIs. (#22821, #22028, #22427, #22465, #21949, #21773, #21932, #22421)
    • ReplayBuffer API (in progress): Allow users to customize and configure their own replay buffers and use these inside custom or built-in algorithms. (#22114, #22390, #21808)
    • Datasets support for RLlib: Dataset Reader/Writer and documentation. (#21808, #22239, #21948)

    🔨 Fixes

    • Fixed memory leak in SimpleReplayBuffer. (#22678)
    • Fixed Unity3D built-in examples: Action bounds from -inf/inf to -1.0/1.0. (#22247)
    • Various bug fixes. (#22350, #22245, #22171, #21697, #21855, #22076, #22590, #22587, #22657, #22428, #23063, #22619, #22731, #22534, #22074, #22078, #22641, #22684, #22398, #21685)

    🏗 Architecture refactoring

    • A3C: Moved into new training_iteration API (from exeution_plan API). Lead to a ~2.7x performance increase on a Atari + CNN + LSTM benchmark. (#22126, #22316)
    • Make multiagent->policies_to_train more flexible via callable option (alternative to providing a list of policy IDs). (#20735)

    💫Enhancements:

    • Env pre-checking module now active by default. (#22191)
    • Callbacks: Added on_sub_environment_created and on_trainer_init callback options. (#21893, #22493)
    • RecSim environment wrappers: Ability to use google’s RecSim for recommender systems more easily w/ RLlib algorithms (3 RLlib-ready example environments). (#22028, #21773, #22211)
    • MARWIL loss function enhancement (exploratory term for stddev). (#21493)

    📖Documentation:

    • Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239)
    • Other doc enhancements and fixes. (#23160, #23226, #22496, #22489, #22380)

    Ray Workflow

    🎉 New Features:

    • Support skip checkpointing.

    🔨 Fixes:

    • Fix an issue where the event loop is not set.

    Tune

    🎉 New Features:

    • Expose new checkpoint interface to users (#22741)

    💫Enhancements:

    • Better error msg for grpc resource exhausted error. (#22806)
    • Add CV support for XGB/LGBM Tune callbacks (#22882)
    • Make sure tune.run can run inside worker thread (https://github.com/ray-project/ray/commit/b8c28d1f2beb7a141f80a5fd6053c8e8520718b9)#22566)
    • Add Trainable.postprocess_checkpoint (#22973) Trainables will now know TUNE_ORIG_WORKING_DIR (#22803)
    • Retry cloud sync up/down/delete on fail (#22029)
    • Support functools.partial names and treat as function in registry (#21518)

    🔨Fixes:

    • Cleanup incorrectly formatted strings (Part 2: Tune) (#23129)
    • fix error handling for fail_fast case. (#22982)
    • Remove Trainable.update_resources (#22471)
    • Bump flaml from 0.6.7 to 0.9.7 in /python/requirements/ml (#22071)
    • Fix analysis without registered trainable (#21475)
    • Update Lightning examples to support PTL 1.5 (#20562)
    • Fix WandbTrainableMixin config for rllib trainables (#22063)
    • [wandb] Use resume=False per default (#21892)

    🏗 Refactoring:

    📖Documentation:

    • Tune docs overhaul (first part) (#22112)
    • Tune overhaul part II (#22656)
    • Note TPESampler performance issues in docs (#22545)
    • hyperopt notebook (#22315)

    Train

    🎉 New Features

    • Integration with PyTorch profiler. Easily enable the pytorch profiler with Ray Train to profile training and visualize stats in Tensorboard (#22345).
    • Automatic pipelining of host to device transfer. While training is happening on one batch of data, the next batch of data is concurrently being moved from CPU to GPU (#22716, #22974)
    • Automatic Mixed Precision. Easily enable PyTorch automatic mixed precision during training (#22227).

    💫 Enhancements

    • Add utility function to enable reproducibility for Pytorch training (#22851)
    • Add initial support for metrics aggregation (#22099)
    • Add support for trainer.best_checkpoint and Trainer.load_checkpoint_path. You can now directly access the best in memory checkpoint, or load an arbitrary checkpoint path to memory. (#22306)

    🔨 Fixes

    • Add a utility function to turn off TF autosharding (#21887)
    • Fix fault tolerance for Tensorflow training (#22508)
    • Train utility methods (train.report(), etc.) can now be called outside of a Train session (#21969)
    • Fix accuracy calculation for CIFAR example (#22292)
    • Better error message for placement group time out (#22845)

    📖 Documentation

    • Update docs for ray.train.torch import (#22555)
    • Clarify shuffle documentation in prepare_data_loader (#22876)
    • Denote train.torch.get_device as a Public API (#22024)
    • Minor fixes on Ray Train user guide doc (#22379)

    Serve

    🎉 New Features

    • Deployment Graph API is now in alpha. It provides a way to build, test and deploy complex inference graph composed of many deployments. (#23177, #23252, #23301, #22840, #22710, #22878, #23208, #23290, #23256, #23324, #23289, #23285, #22473, #23125, #23210)
    • New experimental REST API and CLI for creating and managing deployments. ( #22839, #22257, #23198, #23027, #22039, #22547, #22578, #22611, #22648, #22714, #22805, #22760, #22917, #23059, #23195, #23265, #23157, #22706, #23017, #23026, #23215)
    • New sets of HTTP adapters making it easy to build simple application, as well as Ray AI Runtime model wrappers in alpha. (#22913, #22914, #22915, #22995)
    • New health_check API for end to end user provided health check. (#22178, #22121, #22297)

    🔨 Fixes

    • Autoscaling algorithm will now relingquish most idle nodes when scaling down (#22669)
    • Serve can now manage Java replicas (#22628)
    • Added a hands-on self-contained MLflow and Ray Serve deployment example (#22192)
    • Added root_path setting to http_options (#21090)
    • Remove shard_key, http_method, and http_headers in ServeHandle (#21590)

    Dashboard

    🔨Fixes:

    • Update CPU and memory reporting in kubernetes. (#21688)

    Thanks

    Many thanks to all those who contributed to this release! @edoakes, @pcmoritz, @jiaodong, @iycheng, @krfricke, @smorad, @kfstorm, @jjyyxx, @rodrigodelazcano, @scv119, @dmatrix, @avnishn, @fyrestone, @clarkzinzow, @wumuzi520, @gramhagen, @XuehaiPan, @iasoon, @birgerbr, @n30111, @tbabej, @Zyiqin-Miranda, @suquark, @pdames, @tupui, @ArturNiederfahrenhorst, @ashione, @ckw017, @siddgoel, @Catch-Bull, @vicyap, @spolcyn, @stephanie-wang, @mopga, @Chong-Li, @jjyao, @raulchen, @sven1977, @nikitavemuri, @jbedorf, @mattip, @bveeramani, @czgdp1807, @dependabot[bot], @Fabien-Couthouis, @willfrey, @mwtian, @SlowShip, @Yard1, @WangTaoTheTonic, @Wendi-anyscale, @kaushikb11, @kennethlien, @acxz, @DmitriGekhtman, @matthewdeng, @mraheja, @orcahmlee, @richardliaw, @dsctt, @yupbank, @Jeffwan, @gjoliver, @jovany-wang, @clay4444, @shrekris-anyscale, @jwyyy, @kyle-chen-uber, @simon-mo, @ericl, @amogkam, @jianoaix, @rkooo567, @maxpumperla, @architkulkarni, @chenk008, @xwjiang2010, @robertnishihara, @qicosmos, @sriram-anyscale, @SongGuyang, @jon-chuang, @wuisawesome, @valiantljk, @simonsays1980, @ijrsvt

    Source code(tar.gz)
    Source code(zip)
  • ray-1.11.0(Mar 9, 2022)

    Highlights

    🎉 Ray no longer starts Redis by default. Cluster metadata previously stored in Redis is stored in the GCS now.

    Ray Autoscaler

    🎉 New Features

    • AWS Cloudwatch dashboard support #20266

    💫 Enhancements

    • Kuberay autoscaler prototype #21086

    🔨 Fixes

    • Ray.autoscaler.sdk import issue #21795

    Ray Core

    🎉 New Features

    • Set actor died error message in ActorDiedError #20903
    • Event stats is enabled by default #21515

    🔨 Fixes

    • Better support for nested tasks
    • Fixed 16GB mac perf issue by limit the plasma store size to 2GB #21224
    • Fix SchedulingClassInfo.running_tasks memory leak #21535
    • Round robin during spread scheduling #19968

    🏗 Architecture refactoring

    • Refactor scheduler resource reporting public APIs #21732
    • Refactor ObjectManager wait logic to WaitManager #21369

    Ray Data Processing

    🎉 New Features

    • More powerful to_torch() API, providing more control over the GPU batch format. (#21117)

    🔨 Fixes

    • Fix simple Dataset sort generating only 1 non-empty block. (#21588)
    • Improve error handling across sorting, groupbys, and aggregations. (#21610, #21627)
    • Fix boolean tensor column representation and slicing. (#22358)

    RLlib

    🎉 New Features

    • Better utils for flattening complex inputs and enable prev-actions for LSTM/attention for complex action spaces. (#21330)
    • MultiAgentEnv pre-checker (#21476)
    • Base env pre-checker. (#21569)

    🔨 Fixes

    • Better defaults for QMix (#21332)
    • Fix contrib/MADDPG + pettingzoo coop-pong-v4. (#21452)
    • Fix action unsquashing causes inf/NaN actions for unbounded action spaces. (#21110)
    • Ignore PPO KL-loss term completely if kl-coeff == 0.0 to avoid NaN values (#21456)
    • unsquash_action and clip_action (when None) cause wrong actions computed by Trainer.compute_single_action. (#21553)
    • Conv2d default filter tests and add default setting for 96x96 image obs space. (#21560)
    • Bing back and fix offline RL(BC & MARWIL) learning tests. (#21574, #21643)
    • SimpleQ should not use a prio. replay buffer. (#21665)
    • Fix video recorder env wrapper. Added test case. (#21670)

    🏗 Architecture refactoring

    • Decentralized multi-agent learning (#21421)
    • Preparatory PR for multi-agent multi-GPU learner (alpha-star style) (#21652)

    Ray Workflow

    🔨 Fixes

    • Fixed workflow recovery issue due to a bug of dynamic output #21571

    Tune

    🎉 New Features

    • It is now possible to load all evaluated points from an experiment into a Searcher (#21506)
    • Add CometLoggerCallback (#20766)

    💫 Enhancements

    • Only sync the checkpoint folder instead of the entire trial folder for cloud checkpoint. (#21658)
    • Add test for heterogeneous resource request deadlocks (#21397)
    • Remove unused return_or_clean_cached_pg (#21403)
    • Remove TrialExecutor.resume_trial (#21225)
    • Leave only one canonical way of stopping a trial (#21021)

    🔨 Fixes

    • Replace deprecated running_sanity_check with sanity_checking in PTL integration (#21831)
    • Fix loading an ExperimentAnalysis object without a registered Trainable (#21475)
    • Fix stale node detection bug (#21516)
    • Fixes to allow tune/tests/test_commands.py to run on Windows (#21342)
    • Deflake PBT tests (#21366)
    • Fix dtype coercion in tune.choice (#21270)

    📖 Documentation

    • Fix typo in schedulers.rst (#21777)

    Train

    🎉 New Features

    • Add PrintCallback (#21261)
    • Add MLflowLoggerCallback(#20802)

    💫 Enhancements

    • Refactor Callback implementation (#21468, #21357, #21262)

    🔨 Fixes

    • Fix Dataloader (#21467)

    📖 Documentation

    • Documentation and example fixes (#​​21761, #21689, #21464)

    Serve

    🎉 New Features

    • Checkout our revampt end-to-end tutorial that walks through the deployment journey! (#20765)

    🔨 Fixes

    • Warn when serve.start() with different options (#21562)
    • Detect http.disconnect and cancel requests properly (#21438)

    Thanks Many thanks to all those who contributed to this release! @isaac-vidas, @wuisawesome, @stephanie-wang, @jon-chuang, @xwjiang2010, @jjyao, @MissiontoMars, @qbphilip, @yaoyuan97, @gjoliver, @Yard1, @rkooo567, @talesa, @czgdp1807, @DN6, @sven1977, @kfstorm, @krfricke, @simon-mo, @hauntsaninja, @pcmoritz, @JamieSlome, @chaokunyang, @jovany-wang, @sidward14, @DmitriGekhtman, @ericl, @mwtian, @jwyyy, @clarkzinzow, @hckuo, @vakker, @HuangLED, @iycheng, @edoakes, @shrekris-anyscale, @robertnishihara, @avnishn, @mickelliu, @ndrwnaguib, @ijrsvt, @Zyiqin-Miranda, @bveeramani, @SongGuyang, @n30111, @WangTaoTheTonic, @suquark, @richardliaw, @qicosmos, @scv119, @architkulkarni, @lixin-wei, @Catch-Bull, @acxz, @benblack769, @clay4444, @amogkam, @marin-ma, @maxpumperla, @jiaodong, @mattip, @isra17, @raulchen, @wilsonwang371, @carlogrisetti, @ashione, @matthewdeng

    Source code(tar.gz)
    Source code(zip)
  • ray-1.10.0(Feb 4, 2022)

    Highlights

    • 🎉 Ray Windows support is now in beta – a significant fraction of the Ray test suite is now passing on Windows. We are eager to learn about your experience with Ray 1.10 on Windows, please file issues you encounter at https://github.com/ray-project/ray/issues. In the upcoming releases we will spend more time on making Ray Serve and Runtime Environment tests pass on Windows and on polishing things.

    Ray Autoscaler

    💫Enhancements:

    • Add autoscaler update time to prometheus metrics (#20831)
    • Fewer non terminated nodes calls in autoscaler update (#20359, #20623)

    🔨 Fixes:

    • GCP TPU autoscaling fix (#20311)
    • Scale-down stability fix (#21204)
    • Report node launch failure in driver logs (#20814)

    Ray Client

    💫Enhancements

    • Client task options are encoded with pickle instead of json (#20930)

    Ray Core

    🎉 New Features:

    • runtime_env’s pip field now installs pip packages in your existing environment instead of installing them in a new isolated environment. (#20341)

    🔨 Fixes:

    • Fix bug where specifying runtime_env conda/pip per-job using local requirements file using Ray Client on a remote cluster didn’t work (#20855)
    • Security fixes for log4j2 – the log4j2 version has been bumped to 2.17.1 (#21373)

    💫Enhancements:

    • Allow runtime_env working_dir and py_modules to be pathlib.Path type (#20853, #20810)
    • Add environment variable to skip local runtime_env garbage collection (#21163)
    • Change runtime_env error log to debug log (#20875)
    • Improved reference counting for runtime_env resources (#20789)

    🏗 Architecture refactoring:

    • Refactor runtime_env to use protobuf for multi-language support (#19511)

    📖Documentation:

    • Add more comprehensive runtime_env documentation (#20222, #21131, #20352)

    Ray Data Processing

    🎉 New Features:

    • Added stats framework for debugging Datasets performance (#20867, #21070)
    • [Dask-on-Ray] New config helper for enabling the Dask-on-Ray scheduler (#21114)

    💫Enhancements:

    • Reduce memory usage during when converting to a Pandas DataFrame (#20921)

    🔨 Fixes:

    • Fix slow block evaluation when splitting (#20693)
    • Fix boundary sampling concatenation on non-uniform blocks (#20784)
    • Fix boolean tensor column slicing (#20905)

    🏗 Architecture refactoring:

    • Refactor table block structure to support more tabular block formats (#20721)

    RLlib

    🎉 New Features:

    • Support for RE3 exploration algorithm (for tf only). (#19551)
    • Environment pre-checks, better failure behavior and enhanced environment API. (#20481, #20832, #20868, #20785, #21027, #20811)

    🏗 Architecture refactoring:

    • Evaluation: Support evaluation setting that makes sure train doesn't ever have to wait for eval to finish (b/c of long episodes). (#20757); Always attach latest eval metrics. (#21011)
    • Soft-deprecate build_trainer() utility function in favor of sub-classing Trainer directly (and overriding some of its methods). (#20635, #20636, #20633, #20424, #20570, #20571, #20639, #20725)
    • Experimental no-flatten option for actions/prev-actions. (#20918)
    • Use SampleBatch instead of an input dict whenever possible. (#20746)
    • Switch off Preprocessors by default for PGTrainer (experimental). (#21008)
    • Toward a Replay Buffer API (cleanups; docstrings; renames; move into rllib/execution/buffers dir) (#20552)

    📖Documentation:

    • Overhaul of auto-API reference pages. (#19786, #20537, #20538, #20486, #20250)
    • README and RLlib landing page overhaul (#20249).
    • Added example containing code to compute an adapted (time-dependent) GAE used by the PPO algorithm (#20850).

    🔨 Fixes:

    • Smaller fixes and enhancements: #20704, #20541, #20793, #20743.

    Tune

    🎉 New Features:

    • Introduce TrialCheckpoint class, making checkpoint down/upload easie (#20585)
    • Add random state to BasicVariantGenerator (#20926)
    • Multi-objective support for Optuna (#20489)

    💫Enhancements:

    • Add set_max_concurrency to Searcher API (#20576)
    • Allow for tuples in _split_resolved_unresolved_values. (#20794)
    • Show the name of training func, instead of just ImplicitFunction. (#21029)
    • Enforce one future at a time for any given trial at any given time. (#20783) move on_no_available_trials to a subclass under runner (#20809)
    • Clean up code (#20555, #20464, #20403, #20653, #20796, #20916, #21067)
    • Start restricting TrialRunner/Executor interface exposures. (#20656)
    • TrialExecutor should not take in Runner interface. (#20655)

    🔨Fixes:

    • Deflake test_tune_restore.py (#20776)
    • Fix best_trial_str for nested custom parameter columns (#21078)
    • Fix checkpointing error message on K8s (#20559)
    • Fix testResourceScheduler and testMultiStepRun. (#20872)
    • Fix tune cloud tests for function and rllib trainables (#20536)
    • Move _head_bundle_is_empty after conversion (#21039)
    • Elongate test_trial_scheduler_pbt timeout. (#21120)

    Train

    🔨Fixes:

    • Ray Train environment variables are automatically propagated and do not need to be manually set on every node (#20523)
    • Various minor fixes and improvements (#20952, #20893, #20603, #20487) 📖Documentation:
    • Update saving/loading checkpoint docs (#20973). Thanks @jwyyy!
    • Various minor doc updates (#20877, #20683)

    Serve

    💫Enhancements:

    • Add validation to Serve AutoscalingConfig class (#20779)
    • Add Serve metric for HTTP error codes (#21009)

    🔨Fixes:

    • No longer create placement group for deployment with no resources (#20471)
    • Log errors in deployment initialization/configuration user code (#20620)

    Jobs

    🎉 New Features:

    • Logs can be streamed from job submission server with ray job logs command (#20976)
    • Add documentation for ray job submission (#20530)
    • Propagate custom headers field to JobSubmissionClient and apply to all requests (#20663)

    🔨Fixes:

    • Fix job serve accidentally creates local ray processes instead of connecting (#20705)

    💫Enhancements:

    • [Jobs] Update CLI examples to use the same setup (#20844)

    Thanks

    Many thanks to all those who contributed to this release!

    @dmatrix, @suquark, @tekumara, @jiaodong, @jovany-wang, @avnishn, @simon-mo, @iycheng, @SongGuyang, @ArturNiederfahrenhorst, @wuisawesome, @kfstorm, @matthewdeng, @jjyao, @chenk008, @Sertingolix, @larrylian, @czgdp1807, @scv119, @duburcqa, @runedog48, @Yard1, @robertnishihara, @geraint0923, @amogkam, @DmitriGekhtman, @ijrsvt, @kk-55, @lixin-wei, @mvindiola1, @hauntsaninja, @sven1977, @Hankpipi, @qbphilip, @hckuo, @newmanwang, @clay4444, @edoakes, @liuyang-my, @iasoon, @WangTaoTheTonic, @fgogolli, @dproctor, @gramhagen, @krfricke, @richardliaw, @bveeramani, @pcmoritz, @ericl, @simonsays1980, @carlogrisetti, @stephanie-wang, @AmeerHajAli, @mwtian, @xwjiang2010, @shrekris-anyscale, @n30111, @lchu-ibm, @Scalsol, @seonggwonyoon, @gjoliver, @qicosmos, @xychu, @iamhatesz, @architkulkarni, @jwyyy, @rkooo567, @mattip, @ckw017, @MissiontoMars, @clarkzinzow

    Source code(tar.gz)
    Source code(zip)
  • ray-1.9.2(Jan 11, 2022)

  • ray-1.9.1(Dec 22, 2021)

    Patch release to bump the log4j2 version from 2.14 to 2.16. This resolves the security vulnerabilities https://nvd.nist.gov/vuln/detail/CVE-2021-44228 and https://nvd.nist.gov/vuln/detail/CVE-2021-45046.

    No library or core changes included.

    Thanks @seonggwonyoon and @ijrsvt for contributing the fixes!

    Source code(tar.gz)
    Source code(zip)
  • ray-1.9.0(Dec 3, 2021)

    Highlights

    • Ray Train is now in beta! If you are using Ray Train, we’d love to hear your feedback here!
    • Ray Docker images for multiple CUDA versions are now provided (#19505)! You can specify a -cuXXX suffix to pick a specific version.
      • ray-ml:cpu images are now deprecated. The ray-ml images are only built for GPU.
    • Ray Datasets now supports groupby and aggregations! See the groupby API and GroupedDataset docs for usage.
    • We are making continuing progress in improving Ray stability and usability on Windows. We encourage you to try it out and report feedback or issues at https://github.com/ray-project/ray/issues.
    • We are launching a Ray Job Submission server + CLI & SDK clients to make it easier to submit and monitor Ray applications when you don’t want an active connection using Ray Client. This is currently in alpha, so the APIs are subject to change, but please test it out and file issues / leave feedback on GitHub & discuss.ray.io!

    Ray Autoscaler

    💫Enhancements:

    • Graceful termination of Ray nodes prior to autoscaler scale down (#20013)
    • Ray Clusters on AWS are colocated in one Availability Zone to reduce costs & latency (#19051)

    Ray Client

    🔨 Fixes:

    • ray.put on a list of of objects now returns a single object ref (​​#19737)

    Ray Core

    🎉 New Features:

    • Support remote file storage for runtime_env (#20280, #19315)
    • Added ray job submission client, cli and rest api (#19567, #19657, #19765, #19845, #19851, #19843, #19860, #19995, #20094, #20164, #20170, #20192, #20204)

    💫Enhancements:

    • Garbage collection for runtime_env (#20009, #20072)
    • Improved logging and error messages for runtime_env (#19897, #19888, #18893)

    🔨 Fixes:

    • Fix runtime_env hanging issues (#19823)
    • Fix specifying runtime env in @ray.remote decorator with Ray Client (#19626)
    • Threaded actor / core worker / named actor race condition fixes (#19751, #19598, #20178, #20126)

    📖Documentation:

    • New page “Handling Dependencies”
    • New page “Ray Job Submission: Going from your laptop to production”

    Ray Java

    API Changes:

    • Fully supported namespace APIs. (Check out the namespace for more information.) #19468 #19986 #20057
    • Removed global named actor APIs and global placement group APIs. #20219 #20135
    • Added timeout parameter for Ray.Get() API. #20282

    Note:

    • Use Ray.getActor(name, namespace) API to get a named actor between jobs instead of Ray.getGlobalActor(name).
    • Use PlacementGroup.getPlacementGroup(name, namespace) API to get a placement group between jobs instead of PlacementGroup.getGlobalPlacementGroup(name).

    Ray Datasets

    🎉 New Features:

    • Added groupby and aggregations (#19435, #19673, #20010, #20035, #20044, #20074)
    • Support custom write paths (#19347)

    🔨 Fixes:

    • Support custom CSV write options (#19378)

    🏗 Architecture refactoring:

    • Optimized block compaction (#19681)

    Ray Workflow

    🎉 New Features:

    • Workflow right now support events (#19239)
    • Allow user to specify metadata for workflow and steps (#19372)
    • Allow in-place run a step if the resources match (#19928)

    🔨 Fixes:

    • Fix the s3 path issue (#20115)

    RLlib

    🏗 Architecture refactoring:

    • “framework=tf2” + “eager_tracing=True” is now (almost) as fast as “framework=tf”. A check for tf2.x eager re-traces has been added making sure re-tracing does not happen outside the initial function calls. All CI learning tests (CartPole, Pendulum, FrozenLake) are now also run as framework=tf2. (#19273, #19981, #20109)
    • Prepare deprecation of build_trainer/build_(tf_)?policy utility functions. Instead, use sub-classing of Trainer or Torch|TFPolicy. POCs done for PGTrainer, PPO[TF|Torch]Policy. (#20055, #20061)
    • V-trace (APPO & IMPALA): Don’t drop last ts can be optionally switch on. The default is still to drop it, but this may be changed in a future release. (#19601)
    • Upgrade to gym 0.21. (#19535)

    🔨 Fixes:

    • Minor bugs/issues fixes and enhancements: #19069, #19276, #19306, #19408, #19544, #19623, #19627, #19652, #19693, #19805, #19807, #19809, #19881, #19934, #19945, #20095, #20128, #20134, #20144, #20217, #20283, #20366, #20387

    📖Documentation:

    • RLlib main page (“RLlib in 60sec”) overhaul. (#20215, #20248, #20225, #19932, #19982)
    • Major docstring cleanups in preparation for complete overhaul of API reference pages. (#19784, #19783, #19808, #19759, #19829, #19758, #19830)
    • Other documentation enhancements. (#19908, #19672, #20390)

    Tune

    💫Enhancements:

    • Refactored and improved experiment analysis (#20197, #20181)
    • Refactored cloud checkpointing API/SyncConfig (#20155, #20418, #19632, #19641, #19638, #19880, #19589, #19553, #20045, #20283)
    • Remove magic results (e.g. config) before calculating trial result metrics (#19583)
    • Removal of tech debt (#19773, #19960, #19472, #17654)
    • Improve testing (#20016, #20031, #20263, #20210, #19730
    • Various enhancements (#19496, #20211)

    🔨Fixes:

    • Documentation fixes (#20130, #19791)
    • Tutorial fixes (#20065, #19999)
    • Drop 0 value keys from PGF (#20279)
    • Fix shim error message for scheduler (#19642)
    • Avoid looping through _live_trials twice in _get_next_trial. (#19596)
    • clean up legacy branch in update_avail_resources. (#20071)
    • fix Train/Tune integration on Client (#20351)

    Train

    Ray Train is now in Beta! The beta version includes various usability improvements for distributed PyTorch training and checkpoint management, support for Ray Client, and an integration with Ray Datasets for distributed data ingest.

    Check out the docs here, and the migration guide from Ray SGD to Ray Train here. If you are using Ray Train, we’d love to hear your feedback here!

    🎉 New Features:

    • New train.torch.prepare_model(...) and train.torch.prepare_data_loader(...) API to automatically handle preparing your PyTorch model and DataLoader for distributed training (#20254).
    • Checkpoint management and support for custom checkpoint strategies (#19111).
    • Easily configure what and how many checkpoints to save to disk.
    • Support for Ray Client (#20123, #20351).

    💫Enhancements:

    • Simplify workflow for training with a single worker (#19814).
    • Ray Placement Groups are used for scheduling the training workers (#20091).
    • PACK strategy is used by default but can be changed by setting the TRAIN_ENABLE_WORKER_SPREAD environment variable.
    • Automatically unwrap Torch DDP model and convert to CPU when saving a model as checkpoint (#20333).

    🔨Fixes:

    • Fix HorovodBackend to automatically detect NICs- thanks @tgaddair! (#19533).

    📖Documentation:

    • Denote public facing APIs with beta stability (#20378)
    • Doc updates (#20271)

    Serve

    We would love to hear from you! Fill out the Ray Serve survey here.

    🎉 New Features:

    🔨Fixes:

    • Serve deployment functions or classes can take no parameters (#19708)
    • Replica slow start message is improved. You can now see whether it is slow to allocate resources or slow to run constructor. (#19431)
    • pip install ray[serve] will now install ray[default] as well. (#19570)

    🏗 Architecture refactoring:

    • The terminology of “backend” and “endpoint” are officially deprecated in favor of “deployment”. (#20229, #20085, #20040, #20020, #19997, #19947, #19923, #19798).
    • Progress towards Java API compatibility (#19463).

    Dashboard

    • Ray Dashboard is now enabled on Windows! (#19575)

    Thanks

    Many thanks to all those who contributed to this release! @krfricke, @stefanbschneider, @ericl, @nikitavemuri, @qicosmos, @worldveil, @triciasfu, @AmeerHajAli, @javi-redondo, @architkulkarni, @pdames, @clay4444, @mGalarnyk, @liuyang-my, @matthewdeng, @suquark, @rkooo567, @mwtian, @chenk008, @dependabot[bot], @iycheng, @jiaodong, @scv119, @oscarknagg, @Rohan138, @stephanie-wang, @Zyiqin-Miranda, @ijrsvt, @roireshef, @tkaymak, @simon-mo, @ashione, @jovany-wang, @zenoengine, @tgaddair, @11rohans, @amogkam, @zhisbug, @lchu-ibm, @shrekris-anyscale, @pcmoritz, @yiranwang52, @mattip, @sven1977, @Yard1, @DmitriGekhtman, @ckw017, @WangTaoTheTonic, @wuisawesome, @kcpevey, @kfstorm, @rhamnett, @renos, @TeoZosa, @SongGuyang, @clarkzinzow, @avnishn, @iasoon, @gjoliver, @jjyao, @xwjiang2010, @dmatrix, @edoakes, @czgdp1807, @heng2j, @sungho-joo, @lixin-wei

    Source code(tar.gz)
    Source code(zip)
  • ray-1.8.0(Nov 2, 2021)

    Highlights

    • Ray SGD has been rebranded to Ray Train! The new documentation landing page can be found here.
    • Ray Datasets is now in beta! The beta release includes a new integration with Ray Train yielding scalable ML ingest for distributed training. Check out the docs here, try it out for your ML ingest and batch inference workloads, and let us know how it goes!
    • This Ray release supports Apple Silicon (M1 Macs). Check out the installation instructions for more information!

    Ray Autoscaler

    🎉 New Features:

    • Fake multi-node mode for autoscaler testing (#18987)

    💫Enhancements:

    • Improve unschedulable task warning messages by integrating with the autoscaler (#18724)

    Ray Client

    💫Enhancements

    • Use async rpc for remote call and actor creation (#18298)

    Ray Core

    💫Enhancements

    • Eagerly install job-level runtime_env (#19449, #17949)

    🔨 Fixes:

    • Fixed resource demand reporting for infeasible 1-CPU tasks (#19000)
    • Fixed printing Python stack trace in Python worker (#19423)
    • Fixed macOS security popups (#18904)
    • Fixed thread safety issues for coreworker (#18902, #18910, #18913 #19343)
    • Fixed placement group performance and resource leaking issues (#19277, #19141, #19138, #19129, #18842, #18652)
    • Improve unschedulable task warning messages by integrating with the autoscaler (#18724)
    • Improved Windows support (#19014, #19062, #19171, #19362)
    • Fix runtime_env issues (#19491, #19377, #18988)

    Ray Data

    Ray Datasets is now in beta! The beta release includes a new integration with Ray Train yielding scalable ML ingest for distributed training. It supports repeating and rewindowing pipelines, zipping two pipelines together, better cancellation of Datasets workloads, and many performance improvements. Check out the docs here, try it out for your ML ingest and batch inference workloads, and let us know how it goes!

    🎉 New Features:

    • Ray Train integration (#17626)
    • Add support for repeating and rewindowing a DatasetPipeline (#19091)
    • .iter_epochs() API for iterating over epochs in a DatasetPipeline (#19217)
    • Add support for zipping two datasets together (#18833)
    • Transformation operations are now cancelled when one fails or the entire workload is killed (#18991)
    • Expose from_pandas()/to_pandas() APIs that accept/return plain Pandas DataFrames (#18992)
    • Customize compression, read/write buffer size, metadata, etc. in the IO layer (#19197)
    • Add spread resource prefix for manual round-robin resource-based task load balancing

    💫Enhancements:

    • Minimal rows are now dropped when doing an equalized split (#18953)
    • Parallelized metadata fetches when reading Parquet datasets (#19211)

    🔨 Fixes:

    • Tensor columns now properly support table slicing (#19534)
    • Prevent Datasets tasks from being captured by Ray Tune placement groups (#19208)
    • Empty datasets are properly handled in most transformations (#18983)

    🏗 Architecture refactoring:

    • Tensor dataset representation changed to a table with a single tensor column (#18867)

    RLlib

    🎉 New Features:

    • Allow n-step > 1 and prioritized replay for R2D2 and RNNSAC agents. (18939)

    🔨 Fixes:

    • Fix memory leaks in TF2 eager mode. (#19198)
    • Faster worker spaces inference if specified through configuration. (#18805)
    • Fix bug for complex obs spaces containing Box([2D shape]) and discrete components. (#18917)
    • Torch multi-GPU stats not protected against race conditions. (#18937)
    • Fix SAC agent with dict space. (#19101)
    • Fix A3C/IMPALA in multi-agent setting. (#19100)

    🏗 Architecture refactoring:

    • Unify results dictionary returned from Trainer.train() across agents regardless of (tf or pytorch, multi-agent, multi-gpu, or algos that use >1 SGD iterations, e.g. ppo) (#18879)

    Ray Workflow

    🎉 New Features:

    • Introduce workflow.delete (#19178)

    🔨Fixes:

    • Fix the bug which allow workflow step to be executed multiple times (#19090)

    🏗 Architecture refactoring:

    • Object reference serialization is decoupled from workflow storage (#18328)

    Tune

    🎉 New Features:

    • PBT: Add burn-in period (#19321)

    💫Enhancements:

    • Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once (#19144)
    • Use queue to display JupyterNotebookReporter updates in Ray client (#19137)
    • Add resume="AUTO" and enhance resume error messages (#19181)
    • Provide information about resource deadlocks, early stopping in Tune docs (#18947)
    • Fix HEBOSearch installation docs (#18861)
    • OptunaSearch: check compatibility of search space with evaluated_rewards (#18625)
    • Add save and restore methods for searchers that were missing it & test (#18760)
    • Add documentation for reproducible runs (setting seeds) (#18849)
    • Depreciate max_concurrent in TuneBOHB (#18770)
    • Add on_trial_result to ConcurrencyLimiter (#18766)
    • Ensure arguments passed to tune remote_run match (#18733)
    • Only disable ipython in remote actors (#18789)

    🔨Fixes:

    • Only try to sync driver if sync_to_driver is actually enabled (#19589)
    • sync_client: Fix delete template formatting (#19553)
    • Force no result buffering for hyperband schedulers (#19140)
    • Exclude trial checkpoints in experiment sync (#19185)
    • Fix how durable trainable is retained in global registry (#19223, #19184)
    • Ensure loc column in progress reporter is filled (#19182)
    • Deflake PBT Async test (#19135)
    • Fix Analysis.dataframe() documentation and enable passing of mode=None (#18850)

    Ray Train (SGD)

    Ray SGD has been rebranded to Ray Train! The new documentation landing page can be found here. Ray Train is integrated with Ray Datasets for distributed data loading while training, documentation available here.

    🎉 New Features:

    • Ray Datasets Integration (#17626)

    🔨Fixes:

    • Improved support for multi-GPU training (#18824, #18958)
    • Make actor creation async (#19325)

    📖Documentation:

    • Rename Ray SGD v2 to Ray Train (#19436)
    • Added migration guide from Ray SGD v1 (#18887)

    Serve

    🎉 New Features:

    • Add ability to recover from a checkpoint on cluster failure (#19125)
    • Support kwargs to deployment constructors (#19023)

    🔨Fixes:

    • Fix asyncio compatibility issue (#19298)
    • Catch spurious ConnectionErrors during shutdown (#19224)
    • Fix error with uris=None in runtime_env (#18874)
    • Fix shutdown logic with exit_forever (#18820)

    🏗 Architecture refactoring:

    • Progress towards Serve autoscaling (#18793, #19038, #19145)
    • Progress towards Java support (#18630)
    • Simplifications for long polling (#19154, #19205)

    Dashboard

    🎉 New Features:

    • Basic support for the dashboard on Windows (#19319)

    🔨Fixes:

    • Fix healthcheck issue causing the dashboard to crash under load (#19360)
    • Work around aiohttp 4.0.0+ issues (#19120)

    🏗 Architecture refactoring:

    • Improve dashboard agent retry logic (#18973)

    Thanks

    Many thanks to all those who contributed to this release! @rkooo567, @lchu-ibm, @scv119, @pdames, @suquark, @antoine-galataud, @sven1977, @mvindiola1, @krfricke, @ijrsvt, @sighingnow, @marload, @jmakov, @clay4444, @mwtian, @pcmoritz, @iycheng, @ckw017, @chenk008, @jovany-wang, @jjyao, @hauntsaninja, @franklsf95, @jiaodong, @wuisawesome, @odp, @matthewdeng, @duarteocarmo, @czgdp1807, @gjoliver, @mattip, @richardliaw, @max0x7ba, @Jasha10, @acxz, @xwjiang2010, @SongGuyang, @simon-mo, @zhisbug, @ccssmnn, @Yard1, @hazeone, @o0olele, @froody, @robertnishihara, @amogkam, @sasha-s, @xychu, @lixin-wei, @architkulkarni, @edoakes, @clarkzinzow, @DmitriGekhtman, @avnishn, @liuyang-my, @stephanie-wang, @Chong-Li, @ericl, @juliusfrost, @carlogrisetti

    Source code(tar.gz)
    Source code(zip)
  • ray-1.7.0(Oct 7, 2021)

    Highlights

    • Ray SGD v2 is now in alpha! The v2 version introduces APIs that focus on ease of use and composability. Check out the docs here, and the migration guide from v1 to v2 here.
      • If you are using Ray SGD v2, we’d love to hear your feedback here!
    • Ray Workflows is now in alpha! Check out the docs here and try it out for your large-scale data science, ML, and long-running business workflows. Thanks to our early adopters for the feedback so far and the ongoing contributions from IBM Research.
    • We have made major enhancements to C++ API! While we are still busy hardening the feature for production usage, please check out the docs here, try it out, and help provide feedback!

    Ray Autoscaler

    💫Enhancements:

    • Improvement to logging and code structure #18180
    • Default head node type to 0 max_workers #17757
    • Modifications to accommodate custom node providers #17312

    🔨 Fixes:

    • Helm chart configuration fixes #17678 #18123
    • GCP autoscaler config fix #18653
    • Allow attaching to uninitialized head node for debugging #17688
    • Syncing files with Docker head node fixed #16515

    Ray Client

    🎉 New Features:

    • ray.init() args can be forwarded to remote server (#17776)
    • Allow multiple client connections from one driver (#17942)
    • gRPC channel credentials can now be configured from ray.init (#18425, #18365)
    • Ray Client will attempt to recover connections on certain gRPC failures (#18329)

    💫Enhancements

    • Less confusing client RPC errors (#18278)
    • Use a single RPC to fetch ClientObjectRefs passed in a list (#16944)
    • Increase timeout for ProxyManager.get_channel (#18350)

    🔨 Fixes:

    • Fix mismatched debug log ID formats (#17597)
    • Fix confusing error messages when client scripts exit (#17969)

    Ray Core

    🎉 New Features:

    • Major enhancements in the C++ API!
      • This API library enables you to build a C++ distributed system easily, just like the Python API and the Java API.
      • Run pip install -U ray[cpp] to install Ray with C++ API support.
      • Run ray cpp --help to learn how to use it.
      • For more details, check out the docs here and see the tab “C++”.

    🔨 Fixes:

    • Bug fixes for thread-safety / reference count issues / placement group (#18401, #18746, #18312, #17802, #18526, #17863, #18419, #18463, #18193, #17774, #17772, #17670, #17620, #18584, #18646, #17634, #17732)
    • Better format for object loss errors / task & actor logs (#18742, #18577, #18105, #18292, #17971, #18166)
    • Improved the ray status output for placement groups (#18289, #17892)
    • Improved the function export performance (#18284)
    • Support more Ray core metrics such as RPC call latencies (#17578)
    • Improved error messages and logging for runtime environments (#18451, #18092, #18088, #18084, #18496, #18083)

    Ray Data Processing

    🎉 New Features:

    • Add support for reading partitioned Parquet datasets (#17716)
    • Add dataset unioning (#17793)
    • Add support for splitting a dataset at row indices (#17990)
    • Add from_numpy() and to_numpy() APIs (#18146)
    • Add support for splitting a dataset pipeline at row indices (#18243)
    • Add Modin integration (from_modin() and to_modin()) (#18122)
    • Add support for datasets with tensor columns (#18301)
    • Add RayDP (Spark-on-Ray) integration (from_spark() and to_spark()) (#17340)

    💫Enhancements

    • Drop empty tables when read Parquet fragments in order to properly support filter expressions when reading partitioned Parquet datasets (#18098)
    • Retry application-level errors in Datasets (#18296)
    • Create a directory on write if it doesn’t exist (#18435)
    • URL encode paths if they are URLs (#18440)
    • Guard against a dataset pipeline being read multiple times on accident (#18682)
    • Reduce working set size during random shuffles by eagerly destroying intermediate datasets (#18678)
    • Add manual round-robin resource-based load balancing option to read and shuffle stages (#18678)

    🔨 Fixes:

    • Fix JSON writing so IO roundtrip works (#17691)
    • Fix schema subsetting on column selection during Parquet reads (#18361)
    • Fix Dataset.iter_batches() dropping batches when prefetching (#18441)
    • Fix filesystem inference on path containing space (#18644)

    🏗 Architecture refactoring:

    • Port write side of IO layer to use file-based datasources (#18135)

    RLlib

    🎉 New Features:

    • Replay buffers: Add config option to store contents in checkpoints (store_buffer_in_checkpoints=True). (#17999)
    • Add support for multi-GPU to DDPG. (#17789)

    💫Enhancements:

    • Support for running evaluation and training in parallel, thereby only evaluating as many episodes as the training loop takes (evaluation_num_episodes=”auto”). (#18380)
    • Enhanced stability: Started nightly multi-GPU (2) learning tests for most algos (tf + torch), including LSTM and attention net setups.

    🏗 Architecture refactoring:

    • Make MultiAgentEnv inherit gym.Env to avoid direct class type manipulation (#18156)
    • SampleBatch: Add support for nested data (+Docstring- and API cleanups). (#17485)
    • Add policies arg to callback: on_episode_step (already exists in all other episode-related callbacks) (#18119)
    • Add worker arg (optional) to policy_mapping_fn. (#18184)

    🔨 Fixes:

    • Fix Atari learning test regressions (2 bugs) and 1 minor attention net bug. (#18306)
    • Fix n-step > 1 postprocessing bug (issues 17844, 18034). (#18358)
    • Fix crash when using StochasticSampling exploration (most PG-style algos) w/ tf and numpy version > 1.19.5 (#18366)
    • Strictly run evaluation_num_episodes episodes each evaluation run (no matter the other eval config settings). (#18335)
    • Issue 17706: AttributeError: 'numpy.ndarray' object has no attribute 'items'" on certain turn-based MultiAgentEnvs with Dict obs space. (#17735)
    • Issue 17900: Set seed in single vectorized sub-envs properly, if num_envs_per_worker > 1 (#18110)
    • Fix R2D2 (torch) multi-GPU issue. (#18550)
    • Fix final_scale's default value to 0.02 (see OrnsteinUhlenbeck exploration). (#18070)
    • Ape-X doesn't take the value of prioritized_replay into account (#17541)
    • Issue 17653: Torch multi-GPU (>1) broken for LSTMs. (#17657)
    • Issue 17667: CQL-torch + GPU not working (due to simple_optimizer=False; must use simple optimizer!). (#17742)
    • Add locking to PolicyMap in case it is accessed by a RolloutWorker and the same worker's AsyncSampler or the main LearnerThread. (#18444)
    • Other fixes and enhancements: #18591, #18381, #18670, #18705, #18274, #18073, #18017, #18389, #17896, #17410, #17891, #18368, #17778, #18494, #18466, #17705, #17690, #18254, #17701, #18544, #17889, #18390, #18428, #17821, #17955, #17666, #18423, #18040, #17867, #17583, #17822, #18249, #18155, #18065, #18540, #18367, #17960, #17895, #18467, #17928, #17485, #18307, #18043, #17640, #17702, #15849, #18340

    Tune

    💫Enhancements:

    • Usability improvements when trials appear to be stuck in PENDING state forever when the cluster has insufficient resources. (#18611, #17957, #17533)
    • Searchers and Tune Callbacks now have access to some experiment settings information. (#17724, #17794)
    • Improve HyperOpt KeyError message when metric was not found. (#18549)
    • Allow users to configure bootstrap for docker syncer. (#17786)
    • Allow users to update trial resources on resume. (#17975)
    • Add max_concurrent_trials argument to tune.run. (#17905)
    • Type hint TrialExecutor. Use Abstract Base Class. (#17584)
    • Add developer/stability annotations. (#17442)

    🔨Fixes:

    • Placement group stability issues. (#18706, #18391, #18338)
    • Fix a DurableTrainable checkpointing bug. (#18318)
    • Fix a trial reset bug if a RLlib algorithm with default resources is used. (#18209)
    • Fix hyperopt points to evaluate for nested lists. (#18113)
    • Correctly validate initial points for random search. (#17282)
    • Fix local mode. Add explicit concurrency limiter for local mode. (#18023)
    • Sanitize trial checkpoint filename. (#17985)
    • Explicitly instantiate skopt categorical spaces. (#18005)

    SGD (v2)

    Ray SGD v2 is now in Alpha! The v2 version introduces APIs that focus on ease of use and composability. Check out the docs here, and the migration guide from v1 to v2 here. If you are using Ray SGD v2, we’d love to hear your feedback here!

    🎉 New Features:

    • Ray SGD v2
      • Horovod Backend (#18047)
      • JSON Callback (#17619) and Tensorboard Callback (#17824)
      • Checkpointing Support (#17632, #17807)
      • Fault Tolerance (#18090)
      • Integration with Ray Tune (#17839, #18179)
      • Custom resources per worker (#18327)
      • Low-level Stateful Class API (#18728)

    📖 Documentation:

    Serve

    ↗️Deprecation and API changes:

    • serve.start(http_host=..., http_port=..., http_middlewares=...) has been deprecated since Ray 1.2.0. They are now removed in favor of serve.start(http_options={“host”: …, “port”: …, “middlewares”: …). (#17762)
    • Remove deprecated ServeRequest API (#18120)
    • Remove deprecated endpoints API (#17989)

    🎉 New Features:

    • Serve checkpoint with cluster failure recovery from disk and S3 (#17622, #18293, #18657)

    🔨Fixes:

    • Better serve constructor failure handling (#16922, #18402)
    • Fix get_handle execution from threads (#18198)
    • Remove requirement to specify namespace for serve.start(detached=True) (#17470)

    🏗 Architecture refactoring:

    • Progress towards replica autoscaling (#18658)

    Dashboard

    🎉 New Features:

    • Ray system events are now published in experimental dashboard (#18330, pop #18698)
    • Actor page will now show actors with PENDING_CREATION status (#18666)

    Thanks

    Many thanks to all those who contributed to this release! @scottsun94, @hngenc, @iycheng, @asm582, @jkterry1, @ericl, @thomasdesr, @ryanlmelvin, @ellimac54, @Bam4d, @gjoliver, @juliusfrost, @simon-mo, @ashione, @RaphaelCS, @simonsays1980, @suquark, @jjyao, @lixin-wei, @77loopin, @Ivorforce, @DmitriGekhtman, @dependabot[bot], @souravraha, @robertnishihara, @richardliaw, @SongGuyang, @rkooo567, @edoakes, @jsuarez5341, @zhisbug, @clarkzinzow, @triciasfu, @architkulkarni, @akern40, @liuyang-my, @krfricke, @amogkam, @Jingyu-Peng, @xwjiang2010, @nikitavemuri, @hauntsaninja, @fyrestone, @navneet066, @ijrsvt, @mwtian, @sasha-s, @raulchen, @holdenk, @qicosmos, @Yard1, @yuduber, @mguarin0, @MissiontoMars, @stephanie-wang, @stefanbschneider, @sven1977, @AmeerHajAli, @matthewdeng, @chenk008, @jiaodong, @clay4444, @ckw017, @tchordia, @ThomasLecat, @Chong-Li, @jmakov, @jovany-wang, @tdhopper, @kfstorm, @wgifford, @mxz96102, @WangTaoTheTonic, @lada-kunc, @scv119, @kira-lin, @wuisawesome

    Source code(tar.gz)
    Source code(zip)
  • ray-1.6.0(Aug 23, 2021)

    Highlights

    • Runtime Environments are ready for general use! This feature enables you to dynamically specify per-task, per-actor and per-job dependencies, including a working directory, environment variables, pip packages and conda environments. Install it with pip install -U 'ray[default]'.
    • Ray Dataset is now in alpha! Dataset is an interchange format for distributed datasets, powered by Arrow. You can also use it for a basic Ray native data processing experience. Check it out here.
    • Ray Lightning v0.1 has been released! You can install it via pip install ray-lightning. Ray Lightning is a library of PyTorch Lightning plugins for distributed training using Ray. Features:
    • pip install ray now has a significantly reduced set of dependencies. Features such as the dashboard, the cluster launcher, runtime environments, and observability metrics may require pip install -U 'ray[default]' to be enabled. Please report any issues on Github if this is an issue!

    Ray Autoscaler

    🎉 New Features:

    • The Ray autoscaler now supports TPUs on GCP. Please refer to this example for spinning up a simple TPU cluster. (#17278)

    💫Enhancements:

    • Better AWS networking configurability (#17236 #17207 #14080)
    • Support for running autoscaler without NodeUpdaters (#17194, #17328)

    🔨 Fixes:

    • Code clean up and corrections to downscaling policy (#17352)
    • Docker file sync fix (#17361)

    Ray Client

    💫Enhancements:

    • Updated docs for client server ports and ray.init(ray://) (#17003, #17333)
    • Better error handling for deserialization failures (#17035)

    🔨 Fixes:

    • Fix for server proxy not working with non-default redis passwords (#16885)

    Ray Core

    🎉 New Features:

    • Runtime Environments are ready for general use!
      • Specify a working directory to upload your local files to all nodes in your cluster.
      • Specify different conda and pip dependencies for your tasks and actors and have them installed on the fly.

    🔨 Fixes:

    • Fix plasma store bugs for better data processing stability (#16976, #17135, #17140, #17187, #17204, #17234, #17396, #17550)
    • Fix a placement group bug where CUDA_VISIBLE_DEVICES were not properly detected (#17318)
    • Improved Ray stacktrace messages. (#17389)
    • Improved GCS stability and scalability (#17456, #17373, #17334, #17238, #17072)

    🏗 Architecture refactoring:

    • Plasma store refactor for better testability and extensibility. (#17332, #17313, #17307)

    Ray Data Processing

    Ray Dataset is now in alpha! Dataset is an interchange format for distributed datasets, powered by Arrow. You can also use it for a basic Ray native data processing experience. Check it out here.

    RLLib

    🎉 New Features:

    • Support for RNN/LSTM models with SAC (new agent: "RNNSAC"). Shoutout to ddworak94! (#16577)
    • Support for ONNX model export (tf and torch). (#16805)
    • Allow Policies to be added to/removed from a Trainer on-the-fly. (#17566)

    🔨 Fixes:

    • Fix for view requirements captured during compute actions test pass. Shoutout to Chris Bamford (#15856)

    • Issues: 17397, 17425, 16715, 17174. When on driver, Torch|TFPolicy should not use ray.get_gpu_ids() (b/c no GPUs assigned by ray). (#17444)

    • Other bug fixes: #15709, #15911, #16083, #16716, #16744, #16896, #16999, #17010, #17014, #17118, #17160, #17315, #17321, #17335, #17341, #17356, #17460, #17543, #17567, #17587

    🏗 Architecture refactoring:

    • CV2 to Skimage dependency change (CV2 still supported). Shoutout to Vince Jankovics. (#16841)
    • Unify tf and torch policies wrt. multi-GPU handling: PPO-torch is now 33% faster on Atari and 1 GPU. (#17371)
    • Implement all policy maps inside RolloutWorkers to be LRU-caches so that a large number of policies can be added on-the-fly w/o running out of memory. (#17031)
    • Move all tf static-graph code into DynamicTFPolicy, such that policies can be deleted and their tf-graph is GC'd. (#17169)
    • Simplify multi-agent configs: In most cases, creating dummy envs (only to retrieve spaces) are no longer necessary. (#16565, #17046)

    📖Documentation:

    • Examples scripts do-over (shoutout to Stefan Schneider for this initiative).
    • Example script: League-based self-play with "open spiel" env. (#17077)
    • Other doc improvements: #15664 (shoutout to kk-55), #17030, #17530

    Tune

    🎉 New Features:

    • Dynamic trial resource allocation with ResourceChangingScheduler (#16787)
    • It is now possible to use a define-by-run function to generate a search space with OptunaSearcher (#17464)

    💫Enhancements:

    • String names of searchers/schedulers can now be used directly in tune.run (#17517)
    • Filter placement group resources if not in use (progress reporting) (#16996)
    • Add unit tests for flatten_dict (#17241)

    🔨Fixes:

    • Fix HDFS sync down template (#17291)
    • Re-enable TensorboardX without Torch installed (#17403)

    📖Documentation:

    • LightGBM integration (#17304)
    • Other documentation improvements: #17407 (shoutout to amavilla), #17441, #17539, #17503

    SGD

    🎉 New Features:

    • We have started initial development on a new RaySGD v2! We will be rolling it out in a future version of Ray. See the documentation here. (#17536, #17623, #17357, #17330, #17532, #17440, #17447, #17300, #17253)

    💫Enhancements:

    • Placement Group support for TorchTrainer (#17037)

    Serve

    🎉 New Features:

    • Add Ray API stability annotations to Serve, marking many serve.\* APIs as Stable (#17295)
    • Support runtime_env's working_dir for Ray Serve (#16480)

    🔨Fixes:

    • Fix FastAPI's response_model not added to class based view routes (#17376)
    • Replace backend with deployment in metrics & logging (#17434)

    🏗Stability Enhancements:

    • Run Ray Serve with multi & single deployment large scale (1K+ cores) test running nightly (#17310, #17411, #17368, #17026, #17277)

    Thanks

    Many thanks to all who contributed to this release:

    @suquark, @xwjiang2010, @clarkzinzow, @kk-55, @mGalarnyk, @pdames, @Souphis, @edoakes, @sasha-s, @iycheng, @stephanie-wang, @antoine-galataud, @scv119, @ericl, @amogkam, @ckw017, @wuisawesome, @krfricke, @vakker, @qingyun-wu, @Yard1, @juliusfrost, @DmitriGekhtman, @clay4444, @mwtian, @corentinmarek, @matthewdeng, @simon-mo, @pcmoritz, @qicosmos, @architkulkarni, @rkooo567, @navneet066, @dependabot[bot], @jovany-wang, @kombuchafox, @thomasjpfan, @kimikuri, @Ivorforce, @franklsf95, @MissiontoMars, @lantian-xu, @duburcqa, @ddworak94, @ijrsvt, @sven1977, @kira-lin, @SongGuyang, @kfstorm, @Rohan138, @jamesmishra, @amavilla, @fyrestone, @lixin-wei, @stefanbschneider, @jiaodong, @richardliaw, @WangTaoTheTonic, @chenk008, @Catch-Bull, @Bam4d

    Source code(tar.gz)
    Source code(zip)
  • ray-1.5.2(Aug 12, 2021)

  • ray-1.5.1(Jul 31, 2021)

  • ray-1.5.0(Jul 26, 2021)

    Ray 1.5.0 Release Note

    Highlight

    • Ray Datasets is now in alpha (https://docs.ray.io/en/master/data/dataset.html)
    • LightGBM on Ray is now in beta (https://github.com/ray-project/lightgbm_ray).
      • enables multi-node and multi-GPU training
      • integrates seamlessly with distributed hyperparameter optimization library Ray Tune
      • comes with fault tolerance handling mechanisms, and
      • supports distributed dataframes and distributed data loading

    Ray Autoscaler

    🎉 New Features:

    • Aliyun support (#15712)

    💫 Enhancements:

    • [Kubernetes] Operator refactored to use Kopf package (#15787)
    • Flag to control config bootstrap for rsync (#16667)
    • Prometheus metrics for Autoscaler (#16066, #16198)
    • Allows launching in subnets where public IP assignments off by default (#16816)

    🔨 Fixes:

    • [Kubernetes] Fix GPU=0 resource handling (#16887)
    • [Kubernetes] Release docs updated with K8s test instructions (#16662)
    • [Kubernetes] Documentation update (#16570)
    • [Kubernetes] All official images set to rayproject/ray:latest (#15988 #16205)
    • [Local] Fix bootstrapping ray at a given static set of ips (#16202, #16281)
    • [Azure] Fix Azure Autoscaling Failures (#16640)
    • Handle node type key change / deletion (#16691)
    • [GCP] Retry GCP BrokenPipeError (#16952)

    Ray Client

    🎉 New Features:

    • Client integrations with major Ray Libraries (#15932, #15996, #16103, #16034, #16029, #16111, #16301)
    • Client Connect now returns a context that hasdisconnect and can be used as a context manager (#16021)

    💫 Enhancements:

    • Better support for multi-threaded client-side applications (#16731, #16732)
    • Improved error messages and warnings when misusing Ray Client (#16454, #16508, #16588, #16163)
    • Made Client Object & Actor refs a subclass of their non-client counterparts (#16110)

    🔨 Fixes:

    • dir() Works for client-side Actor Handles (#16157)
    • Avoid server-side time-outs (#16554)
    • Various fixes to the client-server proxy (#16040, #16038, #16057, #16180)

    Ray Core

    🎉 New Features:

    • Ray dataset alpha is available!

    🔨 Fixes:

    • Fix various Ray IO layer issues that fixes hanging & high memory usage (#16408, #16422, #16620, #16824, #16791, #16487, #16407, #16334, #16167, #16153, #16314, #15955, #15775)
    • Namespace now properly isolates placement groups (#16000)
    • More efficient object transfer for spilled objects (#16364, #16352)

    🏗 Architecture refactoring:

    • From Ray 1.5.0, liveness of Ray jobs are guaranteed as long as there’s enough disk space in machines with the “fallback allocator” mechanism which allocates plasma objects to the disk directly when objects cannot be created in memory or spilled to the disk.

    RLlib

    🎉 New Features:

    • Support for adding/deleting Policies to a Trainer on-the-fly (#16359, #16569, #16927).
    • Added new “input API” for customizing offline datasets (shoutout to Julius F.). (#16957)
    • Allow for external env PolicyServer to listen on n different ports (given n rollout workers); No longer require creating an env on the server side to get env’s spaces. (#16583).

    🔨 Fixes:

    • CQL: Bug fixes and clean-ups (fixed iteration count). (#16531, #16332)
    • D4RL: #16721
    • ensure curiosity exploration actions are passed in as tf tensors (shoutout to Manny V.). (#15704)
    • Other bug fixes and cleanups: #16162 and #16309 (shoutout to Chris B.), #15634, #16133, #16860, #16813, #16428, #16867, #16354, #16218, #16118, #16429, #16427, #16774, #16734, #16019, #16171, #16830, #16722

    📖 Documentation and testing:

    • #16311, #15908, #16271, #16080, #16740, #16843

    🏗 Architecture refactoring:

    • All RLlib algos operating on Box action spaces now operate on normalized actions by default (ranging from -1.0 to 1.0). This enables PG-style algos to learn in skewed action spaces. (#16531)

    Tune

    🎉 New Features:

    • New integration with LightGBM via Tune callbacks (#16713).
    • New cost-efficient HPO searchers (BlendSearch and CFO) available from the FLAML library (https://github.com/microsoft/FLAML). (#16329)

    💫 Enhancements:

    • Pass in configurations that have already been evaluated separately to Searchers. This is useful for warm-starting or for meta-searchers, for example (#16485)
    • Sort trials in reporter table by metric (#16576)
    • Add option to keep random values constant over grid search (#16501)
    • Read trial results from json file (#15915)

    🔨 Fixes:

    • Fix infinite loop when using Searcher that limits concurrency internally in conjunction with a ConcurrencyLimiter (#16416)
    • Allow custom sync configuration with DurableTrainable (#16739)
    • Logger fixes. W&B: #16806, #16674, #16839. MLflow: #16840
    • Various bug fixes: #16844, #16017, #16575, #16675, #16504, #15811, #15899, #16128, #16396, #16695, #16611

    📖 Documentation and testing:

    • Use BayesOpt for quick start example (#16997)
    • #16793, #16029, #15932, #16980, #16450, #16709, #15913, #16754, #16619

    SGD

    🎉 New Features:

    • Torch native mixed precision is now supported! (#16382)

    🔨 Fixes:

    • Use target label count for training batch size (#16400)

    📖 Documentation and testing:

    • #15999, #16111, #16301, #16046

    Serve

    💫 Enhancements: UX improvements (#16227, #15909), Improved logging (#16468) 🔨 Fixes: Fix shutdown logic (#16524), Assorted bug fixes (#16647, #16760, #16783) 📖 Documentation and testing: #16042, #16631, #16759, #16786

    Thanks

    Many thanks to all who contributed to this release:

    @Tonyhao96, @simon-mo, @scv119, @Yard1, @llan-ml, @xcharleslin, @jovany-wang, @ijrsvt, @max0x7ba, @annaluo676, @rajagurunath, @zuston, @amogkam, @yorickvanzweeden, @mxz96102, @chenk008, @Bam4d, @mGalarnyk, @kfstorm, @crdnb, @suquark, @ericl, @marload, @jiaodong, @thexiang, @ellimac54, @qicosmos, @mwtian, @jkterry1, @sven1977, @howardlau1999, @mvindiola1, @stefanbschneider, @juliusfrost, @krfricke, @matthewdeng, @zhuangzhuang131419, @brandonJY, @Eleven1Liu, @nikitavemuri, @richardliaw, @iycheng, @stephanie-wang, @HuangLED, @clarkzinzow, @fyrestone, @asm582, @qingyun-wu, @ckw017, @yncxcw, @DmitriGekhtman, @benjamindkilleen, @Chong-Li, @kathryn-zhou, @pcmoritz, @rodrigodelazcano, @edoakes, @dependabot[bot], @pdames, @frenkowski, @loicsacre, @gabrieleoliaro, @achals, @thomasjpfan, @rkooo567, @dibgerge, @clay4444, @architkulkarni, @lixin-wei, @ConeyLiu, @WangTaoTheTonic, @AnnaKosiorek, @wuisawesome, @gramhagen, @zhisbug, @franklsf95, @vakker, @jenhaoyang, @liuyang-my, @chaokunyang, @SongGuyang, @tgaddair

    Source code(tar.gz)
    Source code(zip)
  • ray-1.4.1(Jun 30, 2021)

    Release 1.4.1 Notes

    Ray Python Wheels

    Python 3.9 wheels (Linux / MacOS / Windows) are available (#16347 #16586)

    Ray Autoscaler

    🔨 Fixes: On-prem bug resolved (#16281)

    Ray Client

    💫Enhancements:

    • Add warnings when many tasks scheduled (#16454)
    • Better error messages (#16163)

    🔨 Fixes:

    • Fix gRPC Timeout Options (#16554)
    • Disconnect on dataclient error (#16588)

    Ray Core

    🔨 Fixes:

    • Runtime Environments
    • Fix race condition leading to failed imports #16278
    • Don't broadcast empty resources data (#16104)
    • Fix async actor lost object bug (#16414)
    • Always report job timestamps in milliseconds (#16455, #16545, #16548)
    • Multi-node placement group and job config bug fixes (#16345)
    • Fix bug in task dependency management for duplicate args (#16365)
    • Unify Python and core worker ids (#16712)

    Dask

    💫Enhancements: Dask 2021.06.1 support (#16547)

    Tune

    💫Enhancements: Support object refs in with_params (#16753)

    Serve

    🔨Fixes: Ray serve shutdown goes through Serve controller (#16524)

    Java

    🔨Fixes: Upgrade dependencies to fix CVEs (#16650, #16657)

    Documentation

    • Runtime Environments (#16290)
    • Feature contribution [Tune] (#16477)
    • Ray design patterns and anti-patterns (#16478)
    • PyTorch Lightning (#16484)
    • Ray Client (#16497)
    • Ray Deployment (#16538)
    • Dask version compatibility (#16595)

    CI

    Move wheel and Docker image upload from Travis to Buildkite (#16138 #16241)

    Thanks

    Many thanks to all those who contributed to this release!

    @rkooo567, @clarkzinzow, @WangTaoTheTonic, @ckw017, @stephanie-wang, @Yard1, @mwtian, @jovany-wang, @jiaodong, @wuisawesome, @krfricke, @architkulkarni, @ijrsvt, @simon-mo, @DmitriGekhtman, @amogkam, @richardliaw

    Source code(tar.gz)
    Source code(zip)
  • ray-1.4.0(Jun 7, 2021)

    Release 1.4.0 Notes

    Ray Autoscaler

    🎉 New Features:

    • Support Helm Chart for deploying Ray on Kubernetes
    • Key Autoscaler metrics are now exported via Prometheus!

    💫Enhancements

    • Better error messages when a node fails to come online

    🔨 Fixes:

    • Stability and interface fixes for Kubernetes deployments.
    • Fixes to Azure NodeProvider

    Ray Client

    🎉 New Features:

    • Complete API parity with non-client mode
    • Experimental ClientBuilder API (docs here)
    • Full Asyncio support

    💫Enhancements

    • Keep Alive for Messages for long lived connections
    • Improved pickling error messages

    🔨 Fixes:

    • Client Disconnect can be called multiple times
    • Client Reference Equality Check
    • Many bug fixes and tests for the complete ray API!

    Ray Core

    🎉 New Features:

    • Namespaces (check out the docs)! Note: this may be a breaking change if you’re using detached actors (set ray.init(namespace=””) for backwards compatible behavior).

    🔨 Fixes:

    • Support increment by arbitrary number with ray.util.metrics.Counter
    • Various bug fixes for the placement group APIs including the GPU assignment bug (#15049).

    🏗 Architecture refactoring:

    • Increase the efficiency and robustness of resource reporting

    Ray Data Processing

    🔨 Fixes:

    • Various bug fixes for better stability (#16063, #14821, #15669, #15757, #15431, #15426, #15034, #15071, #15070, #15008, #15955)
    • Fixed a critical bug where the driver uses excessive memory usage when there are many objects in the cluster (#14322).
    • Dask on Ray and Modin can now be run with Ray client

    🏗 Architecture refactoring:

    • Ray 100TB shuffle results: https://github.com/ray-project/ray/issues/15770
    • More robust memory management subsystem is in progress (#15157, #15027)

    RLlib

    🎉 New Features:

    • PyTorch multi-GPU support (#14709, #15492, #15421).
    • CQL TensorFlow support (#15841).
    • Task-settable Env/Curriculum Learning API (#15740).
    • Support for native tf.keras Models (no ModelV2 required) (#14684, #15273).
    • Trainer.train() and Trainer.evaluate() can run in parallel (optional) (#15040, #15345).

    💫Enhancements and documentation:

    • CQL: Bug fixes and confirmed MuJoCo benchmarks (#15814, #15603, #15761).
    • Example for differentiable neural computer (DNC) network (#14844, 15939).
    • Added support for int-Box action spaces. (#15012)
    • DDPG/TD3/A[23]C/MARWIL/BC: Code cleanup and type annotations. (#14707).
    • Example script for restoring 1 agent out of n
    • Examples for fractional GPU usage. (15334)
    • Enhanced documentation page describing example scripts and blog posts (15763).
    • Various enhancements/test coverage improvements: 15499, 15454, 15335, 14865, 15525, 15290, 15611, 14801, 14903, 15735, 15631,

    🔨 Fixes:

    • Memory Leak in multi-agent environment (#15815). Shoutout to Bam4d!
    • DDPG PyTorch GPU bug. (#16133)
    • Simple optimizer should not be used by default for tf+MA (#15365)
    • Various bug fixes: #15762, 14843, 15042, 15427, 15871, 15132, 14840, 14386, 15014, 14737, 15015, 15733, 15737, 15736, 15898, 16118, 15020, 15218, 15451, 15538, 15610, 15326, 15295, 15762, 15436, 15558, 15937

    🏗 Architecture refactoring:

    • Remove atari dependency (#15292).
    • Trainer._evaluate() renamed to Trainer.evaluate() (backward compatible); Trainer.evaluate() can be called even w/o evaluation worker set, if create_env_on_driver=True (#15591).

    Tune

    🎉 New Features:

    • ASHA scheduler now supports save/restore. (#15438)
    • Add HEBO to search algorithm shim function (#15468)
    • Add SkoptSearcher/Bayesopt Searcher restore functionality (#15075)

    💫Enhancements:

    • We now document scalability best practices (k8s, scalability thresholds). You can find this here (#14566)
    • You can now set the result buffer_length via tune.run - this helps with trials that report too frequently. (#15810)
    • Support numpy types in TBXlogger (#15760)
    • Add max_concurrent option to BasicVariantGenerator (#15680)
    • Add seed parameter to OptunaSearch (#15248)
    • Improve BOHB/ConfigSpace dependency check (#15064)

    🔨Fixes:

    • Reduce default number of maximum pending trials to max(16, cluster_cpus) (#15628)
    • Return normalized checkpoint path (#15296)
    • Escape paths before globbing in TrainableUtil.get_checkpoints_paths (#15368)
    • Optuna Searcher: Set correct Optuna TrialState on trial complete (#15283)
    • Fix type annotation in tune.choice (#15038)
    • Avoid system exit error by using del when cleaning up actors (#15687)

    Serve

    🎉 New Features:

    • As of Ray 1.4, Serve has a new API centered around the concept of “Deployments.” Deployments offer a more streamlined API and can be declaratively updated, which should improve both development and production workflows. The existing APIs have not changed from Ray 1.4 and will continue to work until Ray 1.5, at which point they will be removed (see the package reference if you’re not sure about a specific API). Please see the migration guide for details on how to update your existing Serve application to use this new API.
    • New serve.deployment API: @serve.deployment, serve.get_deployments, serve.list_deployments (#14935, #15172, #15124, #15121, #14953, #15152, #15821)
    • New serve.ingress(fastapi_app) API (#15445, 15441, 14858)
    • New @serve.batch decorator in favor of legacy max_batch_size in backend config (#15065)
    • serve.start() is now idempotent (#15148)
    • Added support for handle.method_name.remote() (#14831)

    🔨Fixes:

    • Rolling updates for redeployments (#14803)
    • Latency improvement by using pickle (#15945)
    • Controller and HTTP proxy uses num_cpus=0 by default (#15000)
    • Health checking in the controller instead of using max_restarts (#15047)
    • Use longest prefix matching for path routing (#15041)

    Dashboard

    🎉New Features:

    🔨Fixes:

    • Add object store memory column (#15697)
    • Add object store stats to dashboard API. (#15677)
    • Remove disk data from the dashboard when running on K8s. (#14676)
    • Fix reported dashboard ip when using 0.0.0.0 (#15506)

    Thanks

    Many thanks to all those who contributed to this release!

    @clay4444, @Fabien-Couthouis, @mGalarnyk, @smorad, @ckw017, @ericl, @antoine-galataud, @pleiadesian, @DmitriGekhtman, @robertnishihara, @Bam4d, @fyrestone, @stephanie-wang, @kfstorm, @wuisawesome, @rkooo567, @franklsf95, @micahtyong, @WangTaoTheTonic, @krfricke, @hegdeashwin, @devin-petersohn, @qicosmos, @edoakes, @llan-ml, @ijrsvt, @richardliaw, @Sertingolix, @ffbin, @simjay, @AmeerHajAli, @simon-mo, @tom-doerr, @sven1977, @clarkzinzow, @mxz96102, @SebastianBo1995, @amogkam, @iycheng, @sumanthratna, @Catch-Bull, @pcmoritz, @architkulkarni, @stefanbschneider, @tgaddair, @xcharleslin, @cthoyt, @fcardoso75, @Jeffwan, @mvindiola1, @michaelzhiluo, @rlan, @mwtian, @SongGuyang, @YeahNew, @kathryn-zhou, @rfali, @jennakwon06, @Yeachan-Heo

    Source code(tar.gz)
    Source code(zip)
  • ray-1.3.0(Apr 22, 2021)

    Release v1.3.0 Notes

    Highlights

    • We are now testing and publishing Ray's scalability limits with each release, see: https://github.com/ray-project/ray/tree/releases/1.3.0/benchmarks
    • Ray Client is now usable by default with any Ray cluster started by the Ray Cluster Launcher.

    Ray Cluster Launcher

    💫Enhancements:

    • Observability improvements (#14816, #14608)
    • Worker nodes no longer killed on autoscaler failure (#14424)
    • Better validation for min_workers and max_workers (#13779)
    • Auto detect memory resource for AWS and K8s (#14567)
    • On autoscaler failure, propagate error message to drivers (#14219)
    • Avoid launching GPU nodes when the workload only has CPU tasks (#13776)
    • Autoscaler/GCS compatibility (#13970, #14046, #14050)
    • Testing (#14488, #14713)
    • Migration of configs to multi-node-type format (#13814, #14239)
    • Better config validation (#14244, #13779)
    • Node-type max workers defaults infinity (#14201)

    🔨 Fixes:

    • AWS configuration (#14868, #13558, #14083, #13808)
    • GCP configuration (#14364, #14417)
    • Azure configuration (#14787, #14750, #14721)
    • Kubernetes (#14712, #13920, #13720, #14773, #13756, #14567, #13705, #14024, #14499, #14593, #14655)
    • Other (#14112, #14579, #14002, #13836, #14261, #14286, #14424, #13727, #13966, #14293, #14293, #14718, #14380, #14234, #14484)

    Ray Client

    💫Enhancements:

    • Version checks for Python and client protocol (#13722, #13846, #13886, #13926, #14295)
    • Validate server port number (#14815)
    • Enable Ray client server by default (#13350, #13429, #13442)
    • Disconnect ray upon client deactivation (#13919)
    • Convert Ray objects to Ray client objects (#13639)
    • Testing (#14617, #14813, #13016, #13961, #14163, #14248, #14630, #14756, #14786)
    • Documentation (#14422, #14265)

    🔨 Fixes:

    • Hook runtime context (#13750)
    • Fix mutual recursion (#14122)
    • Set gRPC max message size (#14063)
    • Monitor stream errors (#13386)
    • Fix dependencies (#14654)
    • Fix ray.get ctrl-c (#14425)
    • Report error deserialization errors (#13749)
    • Named actor refcounting fix (#14753)
    • RayTaskError serialization (#14698)
    • Multithreading fixes (#14701)

    Ray Core

    🎉 New Features:

    • We are now testing and publishing Ray's scalability limits with each release. Check out https://github.com/ray-project/ray/tree/releases/1.3.0/benchmarks.
    • [alpha] Ray-native Python-based collective communication primitives for Ray clusters with distributed CPUs or GPUs.

    🔨 Fixes:

    • Ray is now using c++14.
    • Fixed high CPU breaking raylets with heartbeat missing errors (#13963, #14301)
    • Fixed high CPU issues from raylet during object transfer (#13724)
    • Improvement in placement group APIs including better Java support (#13821, #13858, #13582, #15049, #13821)

    Ray Data Processing

    🎉 New Features:

    • Object spilling is turned on by default. Check out the documentation.
    • Dask-on-Ray and Spark-on-Ray are fully ready to use. Please try them out and give us feedback!
    • Dask-on-Ray is now compatible with Dask 2021.4.0.
    • Dask-on-Ray now works natively with dask.persist().

    🔨 Fixes:

    • Various improvements in object spilling and memory management layer to support large scale data processing (#13649, #14149, #13853, #13729, #14222, #13781, #13737, #14288, #14578, #15027)
    • lru_evict flag is now deprecated. Recommended solution now is to use object spilling.

    🏗 Architecture refactoring:

    • Various architectural improvements in object spilling and memory management. For more details, check out the whitepaper.
    • Locality-aware scheduling is turned on by default.
    • Moved from centralized GCS-based object directory protocol to decentralized owner-to-owner protocol, yielding better cluster scalability.

    RLlib

    🎉 New Features:

    • R2D2 implementation for torch and tf. (#13933)
    • PlacementGroup support (all RLlib algos now return PlacementGroupFactory from Trainer.default_resource_request). (#14289)
    • Multi-GPU support for tf-DQN/PG/A2C. (#13393)

    💫Enhancements:

    • Documentation: Update documentation for Curiosity's support of continuous actions (#13784); CQL documentation (#14531)
    • Attention-wrapper works with images and supports prev-n-actions/rewards options. (#14569)
    • rllib rollout runs in parallel by default via Trainer’s evaluation worker set. (#14208)
    • Add env rendering (customizable) and video recording options (for non-local mode; >0 workers; +evaluation-workers) and episode media logging. (#14767, #14796)
    • Allow SAC to use custom models as Q- or policy nets and deprecate "state-preprocessor" for image spaces. (#13522)
    • Example Scripts: Add coin game env + matrix social dilemma env + tests and examples (shoutout to Maxime Riché!). (#14208); Attention net (#14864); Serve + RLlib. (#14416); Env seed (#14471); Trajectory view API (enhancements and tf2 support). (#13786); Tune trial + checkpoint selection. (#14209)
    • DDPG: Add support for simplex action space. (#14011)
    • Others: on_learn_on_batch callback allows custom metrics. (#13584); Add TorchPolicy.export_model(). (#13989)

    🔨 Fixes:

    • Trajectory View API bugs (#13646, #14765, #14037, #14036, #14031, #13555)
    • Test cases (#14620, #14450, #14384, #13835, #14357, #14243)
    • Others (#13013, #14569, #13733, #13556, #13988, #14737, #14838, #15272, #13681, #13764, #13519, #14038, #14033, #14034, #14308, #14243)

    🏗 Architecture refactoring:

    • Remove all non-trajectory view API code. (#14860)
    • Obsolete UsageTrackingDict in favor of SampleBatch. (#13065)

    Tune

    🎉 New Features:

    • We added a new searcher HEBOSearcher (#14504, #14246, #13863, #14427)
    • Tune is now natively compatible with the Ray Client (#13778, #14115, #14280)
    • Tune now uses Ray’s Placement Groups underneath the hood. This will enable much faster autoscaling and training (for distributed trials) (#13906, #15011, #14313)

    💫Enhancements:

    • Checkpointing improvements (#13376, #13767)
    • Optuna Search Algorithm improvements (#14731, #14387)
    • tune.with_parameters now works with Class API (#14532)

    🔨Fixes:

    • BOHB & Hyperband fixes (#14487, #14171)
    • Nested metrics improvements (#14189, #14375, #14379)
    • Fix non-deterministic category sampling (#13710)
    • Type hints (#13684)
    • Documentation (#14468, #13880, #13740)
    • Various issues and bug fixes (#14176, #13939, #14392, #13812, #14781, #14150, #14850, #14118, #14388, #14152, #13825, #13936)

    SGD

    • Add fault tolerance during worker startup (#14724)

    Serve

    🎉 New Features:

    • Added metadata to default logger in backend replicas (#14251)
    • Added more metrics for ServeHandle stats (#13640)
    • Deprecated system-level batching in favor of @serve.batch (#14610, #14648)
    • Beta support for Serve with Ray client (#14163)
    • Use placement groups to bypass autoscaler throttling (#13844)
    • Deprecate client-based API in favor of process-wide singleton (#14696)
    • Add initial support for FastAPI ingress (#14754)

    🔨 Fixes:

    • Fix ServeHandle serialization (#13695)

    🏗 Architecture refactoring:

    • Refactor BackendState to support backend versioning and add more unit testing (#13870, #14658, #14740, #14748)
    • Optimize long polling to be per-key (#14335)

    Dashboard

    🎉 New Features:

    • Dashboard now supports being served behind a reverse proxy. (#14012)
    • Disk and network metrics are added to prometheus. (#14144)

    💫Enhancements:

    • Better CPU & memory information on K8s. (#14593, #14499)
    • Progress towards a new scalable dashboard. (#13790, #11667, #13763,#14333)

    Thanks

    Many thanks to all those who contributed to this release: @geraint0923, @iycheng, @yurirocha15, @brian-yu, @harryge00, @ijrsvt, @wumuzi520, @suquark, @simon-mo, @clarkzinzow, @RaphaelCS, @FarzanT, @ob, @ashione, @ffbin, @robertnishihara, @SongGuyang, @zhe-thoughts, @rkooo567, @Ezra-H, @acxz, @clay4444, @QuantumMecha, @jirkafajfr, @wuisawesome, @Qstar, @guykhazma, @devin-petersohn, @jeroenboeye, @ConeyLiu, @dependabot[bot], @fyrestone, @micahtyong, @javi-redondo, @Manuscrit, @mxz96102, @EscapeReality846089495, @WangTaoTheTonic, @stanislav-chekmenev, @architkulkarni, @Yard1, @tchordia, @zhisbug, @Bam4d, @niole, @yiranwang52, @thomasjpfan, @DmitriGekhtman, @gabrieleoliaro, @jparkerholder, @kfstorm, @andrew-rosenfeld-ts, @erikerlandson, @Crissman, @raulchen, @sumanthratna, @Catch-Bull, @chaokunyang, @krfricke, @raoul-khour-ts, @sven1977, @kathryn-zhou, @AmeerHajAli, @jovany-wang, @amogkam, @antoine-galataud, @tgaddair, @randxie, @ChaceAshcraft, @ericl, @cassidylaidlaw, @TanjaBayer, @lixin-wei, @lena-kashtelyan, @cathrinS, @qicosmos, @richardliaw, @rmsander, @jCrompton, @mjschock, @pdames, @barakmich, @michaelzhiluo, @stephanie-wang, @edoakes

    Source code(tar.gz)
    Source code(zip)
  • ray-1.2.0(Feb 13, 2021)

    Release v1.2.0 Notes

    Highlights

    • Ray client is now in beta! Check out more details here: https://docs.ray.io/en/master/ray-client.html XGBoost-Ray is now in beta! Check out more details about this project at https://github.com/ray-project/xgboost_ray.
    • Check out the Serve migration guide: https://docs.google.com/document/d/1CG4y5WTTc4G_MRQGyjnb_eZ7GK3G9dUX6TNLKLnKRAc/edit
    • Ray’s C++ support is now in beta: https://docs.ray.io/en/master/#getting-started-with-ray
    • An alpha version of object spilling is now available: https://docs.ray.io/en/master/memory-management.html#object-spilling

    Ray Autoscaler

    🎉 New Features:

    • A new autoscaler output format in monitor.log (#12772, #13561)
    • Piping autoscaler events to driver logs (#13434)

    💫Enhancements

    • Full support of ray.autoscaler.sdk.request_resources() API (https://docs.ray.io/en/master/cluster/autoscaling.html?highlight=request_resources#ray.autoscaler.sdk.request_resources) .
    • Make placement groups bypass max launch limit (#13089)
    • [K8s] Retry getting home directory in command runner. (#12925)
    • [docker] Pull if image is not present (#13136)
    • [Autoscaler] Ensure ubuntu is owner of docker host mount folder (#13579)

    🔨 Fixes:

    • Many autoscaler bug fixes (#12952, #12689, #13058, #13671, #13637, #13588, #13505, #13154, #13151, #13138, #13008, #12980, #12918, #12829, #12714, #12661, #13567, #13663, #13623, #13437, #13498, #13472, #13392, #12514, #13325, #13161, #13129, #12987, #13410, #12942, #12868, #12866, #12865, #12098, #12609)

    RLLib

    🎉 New Features:

    • Fast Attention Nets (using the trajectory view API) (#12753).
    • Attention Nets: Full PyTorch support (#12029).
    • Attention Nets: Support auto-wrapping around default- or custom models by specifying “use_attention=True” in the model’s config. * * * This works completely analogously now to “use_lstm=True”. (#11698)
    • New Offline RL Algorithm: CQL (based on SAC) (#13118).
    • MAML: Discrete actions support (added CartPole mass test case).
    • Support Atari framestacking via the trajectory view API (#13315).
    • Support for D4RL environments/benchmarks (#13550).
    • Preliminary work on JAX support (#13077, #13091).

    💫 Enhancements:

    • Rollout lengths: Allow unit to be configured as “agent_steps” in multi-agent settings (default: “env_steps”) (#12420).
    • TFModelV2: Soft-deprecate register_variables and unify var names wrt TorchModelV2 (#13339, #13363).

    📖 Documentation:

    • Added documentation on Model building API (#13260, #13261).
    • Added documentation for the trajectory view API. (#12718)
    • Added documentation for SlateQ (#13266).
    • Readme.md documentation for almost all algorithms in rllib/agents (#12943, #13035).
    • Type annotations for the “rllib/execution” folder (#12760, #13036).

    🔨 Fixes:

    • MARWIL and BC: Add grad-clipping config option to stabilize learning (#13455).
    • A3C: Solve PyTorch- and TF-eager async race condition between calling model and its value function (#13467).
    • Various issues- and bug fixes (#12619, #12682, #12704, #12706, #12708, #12765, #12786, #12787, #12793, #12832, #12844, #12846, #12915, #12941, #13039, #13040, #13064, #13083, #13121, #13126, #13237, #13238, #13308, #13332, #13397, #13459, #13553). ###🏗 Architecture refactoring:
    • Env directory has been cleaned up and is now divided in: Core part (rllib/env) with all basic env classes, and rllib/env/wrappers containing third-party wrapper classes (Atari, Unity3D, etc..) (#13082).

    Tune

    🎉 New Features:

    💫 Enhancements

    • Ray Tune now uses ray.cloudpickle underneath the hood, allowing you to checkpoint large models (>4GB) (#12958).
    • Using the 'reuse_actors' flag can now speed up training for general Trainable API usage. (#13549)
    • Ray Tune will now automatically buffer results from trainables, allowing you to use an arbitrary reporting frequency on your training functions. (#13236)
    • Ray Tune now has a variety of experiment stoppers (#12750)
    • Ray Tune now supports an integer loguniform search space distribution (#12994)
    • Ray Tune now has an initial support for the Ray placement group API. (#13370)
    • The Weights and Bias integration (WandbLogger) now also accepts wandb.data_types.Video (#13169)
    • The Hyperopt integration (HyperoptSearch) can now directly accept category variables instead of indices (#12715)
    • Ray Tune now supports experiment checkpointing when using grid search (#13357)

    🔨Fixes and Updates

    • The Optuna integration was updated to support the 2.4.0 API while maintaining backwards compatibility (#13631)
    • All search algorithms now support points_to_evaluate (#12790, #12916)
    • PBT Transformers example was updated and improved (#13174, #13131)
    • The scikit-optimize integration was improved (#12970)
    • Various bug fixes (#13423, #12785, #13171, #12877, #13255, #13355)

    SGD

    🔨Fixes and Updates

    • Fix Docstring for as_trainable (#13173)
    • Fix process group timeout units (#12477)
    • Disable Elastic Training by default when using with Tune (#12927)

    Serve

    🎉 New Features:

    • Ray Serve backends now accept a Starlette request object instead of a Flask request object (#12852). This is a breaking change, so please read the migration guide.
    • Ray Serve backends now have the option of returning a Starlette Response object (#12811, #13328). This allows for more customizable responses, including responses with custom status codes.
    • [Experimental] The new Ray Serve MLflow plugin makes it easy to deploy your MLflow models on Ray Serve. It comes with a Python API and a command-line interface.
    • Using “ImportedBackend” you can now specify a backend based on a class that is installed in the Python environment that the workers will run in, even if the Python environment of the driver script (the one making the Serve API calls) doesn’t have it installed (#12923).

    💫 Enhancements:

    • Dependency management using conda no longer requires the driver script to be running in an activated conda environment (#13269).
    • Ray ObjectRef can now be used as argument to serve_handle.remote(...). (#12592)
    • Backends are now shut down gracefully. You can set the graceful timeout in BackendConfig. (#13028)

    📖 Documentation:

    • A tutorial page has been added for integrating Ray Serve with your existing FastAPI web server or with your existing AIOHTTP web server (#13127).
    • Documentation has been added for Ray Serve metrics (#13096).
    Source code(tar.gz)
    Source code(zip)
  • ray-1.1.0(Dec 24, 2020)

    Ray 1.1.0

    Ray Core

    🎉 New Features:

    • Progress towards supporting a Ray client
    • Descendent tasks are cancelled when the calling task is cancelled

    🔨 Fixes:

    • Improved object broadcast robustness
    • Improved placement group support

    🏗 Architecture refactoring:

    • Progress towards the new scheduler backend

    RLlib

    🎉 New Features:

    • SUMO simulator integration (rllib/examples/simulators/sumo/). Huge thanks to Lara Codeca! (#11710)
    • SlateQ Algorithm added for PyTorch. Huge thanks to Henry Chen! (#11450)
    • MAML extension for all Models, except recurrent ones. (#11337)
    • Curiosity Exploration Module for tf1.x/2.x/eager. (#11945)
    • Minimal JAXModelV2 example. (#12502)

    🔨 Fixes:

    • Fix RNN learning for tf2.x/eager. (#11720)
    • LSTM prev-action/prev-reward settable separately and prev-actions are now one-hot’d. (#12397)
    • PyTorch LR schedule not working. (#12396)
    • Various PyTorch GPU bug fixes. (#11609)
    • SAC loss not using prio. replay weights in critic’s loss term. (#12394)
    • Fix epsilon-greedy Exploration for nested action spaces. (#11453)

    🏗 Architecture refactoring:

    • Trajectory View API on by default (faster PG-type algos by ~20% (e.g. PPO on Atari)). (#11717, #11826, #11747, and #11827)

    Tune

    🎉 New Features:

    • Loggers can now be passed as objects to tune.run. The new ExperimentLogger abstraction was introduced for all loggers, making it much easier to configure logging behavior. (#11984, #11746, #11748, #11749)
    • The tune verbosity was refactored into four levels: 0: Silent, 1: Only experiment-level logs, 2: General trial-level logs, 3: Detailed trial-level logs (default) (#11767, #12132, #12571)
    • Docker and Kubernetes autoscaling environments are detected automatically, automatically utilizing the correct checkpoint/log syncing tools (#12108)
    • Trainables can now easily leverage Tensorflow DistributedStrategy! (#11876)

    💫 Enhancements

    • Introduced a new serialization debugging utility (#12142)
    • Added a new lightweight Pytorch-lightning example (#11497, #11585)
    • The BOHB search algorithm can be seeded with a random state (#12160)
    • The default anonymous metrics can be used automatically if a mode is set in tune.run (#12159).
    • Added HDFS as Cloud Sync Client (#11524)
    • Added xgboost_ray integration (#12572)
    • Tune search spaces can now be passed to search algorithms on initialization, not only via tune.run (#11503)
    • Refactored and added examples (#11931)
    • Callable accepted for register_env (#12618)
    • Tune search algorithms can handle/ignore infinite and NaN numbers (#11835)
    • Improved scalability for experiment checkpointing (#12064)
    • Nevergrad now supports points_to_evaluate (#12207)
    • Placement group support for distributed training (#11934)

    🔨 Fixes:

    • Fixed with_parameters behavior to avoid serializing large data in scope (#12522)
    • TBX logger supports None (#12262)
    • Better error when metric or mode unset in search algorithms (#11646)
    • Better warnings/exceptions for fail_fast='raise' (#11842)
    • Removed some bottlenecks in trialrunner (#12476)
    • Fix file descriptor leak by syncer and Tensorboard (#12590, #12425)
    • Fixed validation for search metrics (#11583)
    • Fixed hyperopt randint limits (#11946)

    Serve

    🎉 New Features:

    • You can start backends in different conda environments! See more in the dependency management doc. (#11743)
    • You can add a optional reconfigure method to your Servable to allow reconfiguring backend replicas at runtime. (#11709)

    🔨Fixes:

    • Set serve.start(http_host=None) to disable HTTP servers. If you are only using ServeHandle, this option lowers resource usage. (#11627)
    • Flask requests will no longer create reference cycles. This means peak memory usage should be lower for high traffic scenarios. (#12560)

    🏗 Architecture refactoring:

    • Progress towards a goal state driven Serve controller. (#12369,#11792,#12211,#12275,#11533,#11822,#11579,#12281)
    • Progress towards faster and more efficient ServeHandles. (#11905, #12019, #12093)

    Ray Cluster Launcher (Autoscaler)

    🎉 New Features:

    • A new Kubernetes operator: https://docs.ray.io/en/master/cluster/k8s-operator.html

    💫 Enhancements

    • Containers do not run with root user as the default (#11407)
    • SHM-Size is auto-populated when using the containers (#11953)

    🔨 Fixes:

    • Many autoscaler bug fixes (#11677, #12222, #11458, #11896, #12123, #11820, #12513, #11714, #12512, #11758, #11615, #12106, #11961, #11674, #12028, #12020, #12316, #11802, #12131, #11543, #11517, #11777, #11810, #11751, #12465, #11422)

    SGD

    🎉 New Features:

    • Easily customize your torch.DistributedDataParallel configurations by passing in a ddp_args field into TrainingOperator.register (#11771).

    🔨 Fixes:

    • TorchTrainer now properly scales up to more workers if more resources become available (#12562)

    📖 Documentation:

    • The new callback API for using Ray SGD with Tune is now documented (#11479)
    • Pytorch Lightning + Ray SGD integration is now documented (#12440)

    Dashboard

    🔨 Fixes:

    • Fixed bug that prevented viewing the logs for cluster workers
    • Fixed bug that caused "Logical View" page to crash when opening a list of actors for a given class.

    🏗 Architecture refactoring:

    • Dashboard runs on a new backend architecture that is more scalable and well-tested. The dashboard should work on ~100 node clusters now, and we're working on lifting scalability to constraints to support even larger clusters.

    Thanks

    Many thanks to all those who contributed to this release: @bartbroere, @SongGuyang, @gramhagen, @richardliaw, @ConeyLiu, @weepingwillowben, @zhongchun, @ericl, @dHannasch, @timurlenk07, @kaushikb11, @krfricke, @desktable, @bcahlit, @rkooo567, @amogkam, @micahtyong, @edoakes, @stephanie-wang, @clay4444, @ffbin, @mfitton, @barakmich, @pcmoritz, @AmeerHajAli, @DmitriGekhtman, @iamhatesz, @raulchen, @ingambe, @allenyin55, @sven1977, @huyz-git, @yutaizhou, @suquark, @ashione, @simon-mo, @raoul-khour-ts, @Leemoonsoo, @maximsmol, @alanwguo, @kishansagathiya, @wuisawesome, @acxz, @gabrieleoliaro, @clarkzinzow, @jparkerholder, @kingsleykuan, @InnovativeInventor, @ijrsvt, @lasagnaphil, @lcodeca, @jiajiexiao, @heng2j, @wumuzi520, @mvindiola1, @aaronhmiller, @robertnishihara, @WangTaoTheTonic, @chaokunyang, @nikitavemuri, @kfstorm, @roireshef, @fyrestone, @viotemp1, @yncxcw, @karstenddwx, @hartikainen, @sumanthratna, @architkulkarni, @michaelzhiluo, @UWFrankGu, @oliverhu, @danuo, @lixin-wei

    Source code(tar.gz)
    Source code(zip)
  • ray-1.0.1.post1(Nov 19, 2020)

    Patch release containing the following changes:

    • https://github.com/ray-project/ray/commit/bcc92f59fdcd837ccc5a560fe37bdf0619075505 Fix dashboard crashing on multi-node clusters.
    • https://github.com/ray-project/ray/pull/11600 Add the cluster_name to docker file mounts directory prefix.
    Source code(tar.gz)
    Source code(zip)
  • ray-1.0.1(Nov 10, 2020)

    Ray 1.0.1

    Ray 1.0.1 is now officially released!

    Highlights

    • If you're migrating from Ray < 1.0.0, be sure to check out the 1.0 Migration Guide.
    • Autoscaler is now docker by default.
    • RLLib features multiple new environments.
    • Tune supports population based bandits, checkpointing in Docker, and multiple usability improvements.
    • SGD supports PyTorch Lightning
    • All of Ray's components and libraries have improved performance, scalability, and stability.

    Core

    • 1.0 Migration Guide.
    • Many bug fixes and optimizations in GCS.
    • Polishing of the Placement Group API.
    • Improved Java language support

    RLlib

    • Added documentation for Curiosity exploration module (#11066).
    • Added RecSym environment wrapper (#11205).
    • Added Kaggle’s football environment (multi-agent) wrapper (#11249).
    • Multiple bug fixes: GPU related fixes for SAC (#11298), MARWIL, all example scripts run on GPU (#11105), lifted limitation on 2^31 timesteps (#11301), fixed eval workers for ES and ARS (#11308), fixed broken no-eager-no-workers mode (#10745).
    • Support custom MultiAction distributions (#11311).
    • No environment is created on driver (local worker) if not necessary (#11307).
    • Added simple SampleCollector class for Trajectory View API (#11056).
    • Code cleanup: Docstrings and type annotations for Exploration classes (#11251), DQN (#10710), MB-MPO algorithm, SAC algorithm (#10825).

    Serve

    • API: Serve will error when serve_client is serialized. (#11181)
    • Performance: serve_client.get_handle("endpoint") will now get a handle to nearest node, increasing scalability in distributed mode. (#11477)
    • Doc: Added FAQ page and updated architecture page (#10754, #11258)
    • Testing: New distributed tests and benchmarks are added (#11386)
    • Testing: Serve now run on Windows (#10682)

    SGD

    • Pytorch Lightning integration is now supported (#11042)
    • Support num_steps continue training (#11142)
    • Callback API for SGD+Tune (#11316)

    Tune

    • New Algorithm: Population-based Bandits (#11466)
    • tune.with_parameters(), a wrapper function to pass arbitrary objects through the object store to trainables (#11504)
    • Strict metric checking - by default, Tune will now error if a result dict does not include the optimization metric as a key. You can disable this with TUNE_DISABLE_STRICT_METRIC_CHECKING (#10972)
    • Syncing checkpoints between multiple Docker containers on a cluster is now supported with the DockerSyncer (#11035)
    • Added type hints (#10806)
    • Trials are now dynamically created (instead of created up front) (#10802)
    • Use tune.is_session_enabled() in the Function API to toggle between Tune and non-tune code (#10840)
    • Support hierarchical search spaces for hyperopt (#11431)
    • Tune function API now also supports yield and return statements (#10857)
    • Tune now supports callbacks with tune.run(callbacks=... (#11001)
    • By default, the experiment directory will be dated (#11104)
    • Tune now supports reuse_actors for function API, which can largely accelerate tuning jobs.

    Thanks

    We thank all the contributors for their contribution to this release!

    @acxz, @Gekho457, @allenyin55, @AnesBenmerzoug, @michaelzhiluo, @SongGuyang, @maximsmol, @WangTaoTheTonic, @Basasuya, @sumanthratna, @juliusfrost, @maxco2, @Xuxue1, @jparkerholder, @AmeerHajAli, @raulchen, @justinkterry, @herve-alanaai, @richardliaw, @raoul-khour-ts, @C-K-Loan, @mattearllongshot, @robertnishihara, @internetcoffeephone, @Servon-Lee, @clay4444, @fangyeqing, @krfricke, @ffbin, @akotlar, @rkooo567, @chaokunyang, @PidgeyBE, @kfstorm, @barakmich, @amogkam, @edoakes, @ashione, @jseppanen, @ttumiel, @desktable, @pcmoritz, @ingambe, @ConeyLiu, @wuisawesome, @fyrestone, @oliverhu, @ericl, @weepingwillowben, @rkube, @alanwguo, @architkulkarni, @lasagnaphil, @rohitrawat, @ThomasLecat, @stephanie-wang, @suquark, @ijrsvt, @VishDev12, @Leemoonsoo, @scottwedge, @sven1977, @yiranwang52, @carlos-aguayo, @mvindiola1, @zhongchun, @mfitton, @simon-mo

    Source code(tar.gz)
    Source code(zip)
  • ray-1.0.0(Sep 30, 2020)

    Ray 1.0

    We're happy to announce the release of Ray 1.0, an important step towards the goal of providing a universal API for distributed computing.

    To learn more about Ray 1.0, check out our blog post and whitepaper.

    Ray Core

    • The ray.init() and ray start commands have been cleaned up to remove deprecated arguments
    • The Ray Java API is now stable
    • Improved detection of Docker CPU limits
    • Add support and documentation for Dask-on-Ray and MARS-on-Ray: https://docs.ray.io/en/master/ray-libraries.html
    • Placement groups for fine-grained control over scheduling decisions: https://docs.ray.io/en/latest/placement-group.html.
    • New architecture whitepaper: https://docs.ray.io/en/master/whitepaper.html

    Autoscaler

    • Support for multiple instance types in the same cluster: https://docs.ray.io/en/master/cluster/autoscaling.html
    • Support for specifying GPU/accelerator type in @ray.remote

    Dashboard & Metrics

    • Improvements to the memory usage tab and machine view
    • The dashboard now supports visualization of actor states
    • Support for Prometheus metrics reporting: https://docs.ray.io/en/latest/ray-metrics.html

    RLlib

    • Two Model-based RL algorithms were added: MB-MPO (“Model-based meta-policy optimization”) and “Dreamer”. Both algos were benchmarked and are performing comparably to the respective papers’ reported results.
    • A “Curiosity” (intrinsic motivation) module was added via RLlib’s Exploration API and benchmarked on a sparse-reward Unity3D environment (Pyramids).
    • Added documentation for the Distributed Execution API.
    • Removed (already soft-deprecated) APIs: Model(V1) class, Trainer config keys, some methods/functions. Where you would see a warning previously when using these, there will be an error thrown now.
    • Added DeepMind Control Suite examples.

    Tune

    Breaking changes:

    • Multiple tune.run parameters have been deprecated: ray_auto_init, run_errored_only, global_checkpoint_period, with_server (#10518)
    • tune.run(upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint have been moved to tune.SyncConfig [docs] (#10518)

    New APIs:

    • mode, metric, time_budget parameters for tune.run (#10627, #10642)
    • Search Algorithms now share a uniform API: (#10621, #10444). You can also use the new create_scheduler/create_searcher shim layer to create search algorithms/schedulers via string, reducing boilerplate code (#10456).
    • Native callbacks for: MXNet, Horovod, Keras, XGBoost, PytorchLightning (#10533, #10304, #10509, #10502, #10220)
    • PBT runs can be replayed with PopulationBasedTrainingReplay scheduler (#9953)
    • Search Algorithms are saved/resumed automatically (#9972)
    • New Optuna Search Algorithm docs (#10044)
    • Tune now can sync checkpoints across Kubernetes pods (#10097)
    • Failed trials can be rerun with tune.run(resume="run_errored_only") (#10060)

    Other Changes:

    • Trial outputs can be saved to file via tune.run(log_to_file=...) (#9817)
    • Trial directories can be customized, and default trial directory now includes trial name (#10608, #10214)
    • Improved Experiment Analysis API (#10645)
    • Support for Multi-objective search via SigOpt Wrapper (#10457, #10446)
    • BOHB Fixes (#10531, #10320)
    • Wandb improvements + RLlib compatibility (#10950, #10799, #10680, #10654, #10614, #10441, #10252, #8521)
    • Updated documentation for FAQ, Tune+serve, search space API, lifecycle (#10813, #10925, #10662, #10576, #9713, #10222, #10126, #9908)

    RaySGD:

    • Creator functions are subsumed by the TrainingOperator API (#10321)
    • Training happens on actors by default (#10539)

    Serve

    • serve.client API makes it easy to appropriately manage lifetime for multiple Serve clusters. (#10460)
    • Serve APIs are fully typed. (#10205, #10288)
    • Backend configs are now typed and validated via Pydantic. (#10559, #10389)
    • Progress towards application level backend autoscaler. (#9955, #9845, #9828)
    • New architecture page in documentation. (#10204)

    Thanks

    We thank all the contributors for their contribution to this release!

    @MissiontoMars, @ijrsvt, @desktable, @kfstorm, @lixin-wei, @Yard1, @chaokunyang, @justinkterry, @pxc, @ericl, @WangTaoTheTonic, @carlos-aguayo, @sven1977, @gabrieleoliaro, @alanwguo, @aryairani, @kishansagathiya, @barakmich, @rkube, @SongGuyang, @qicosmos, @ffbin, @PidgeyBE, @sumanthratna, @yushan111, @juliusfrost, @edoakes, @mehrdadn, @Basasuya, @icaropires, @michaelzhiluo, @fyrestone, @robertnishihara, @yncxcw, @oliverhu, @yiranwang52, @ChuaCheowHuan, @raphaelavalos, @suquark, @krfricke, @pcmoritz, @stephanie-wang, @hekaisheng, @zhijunfu, @Vysybyl, @wuisawesome, @sanderland, @richardliaw, @simon-mo, @janblumenkamp, @zhuohan123, @AmeerHajAli, @iamhatesz, @mfitton, @noahshpak, @maximsmol, @weepingwillowben, @raulchen, @09wakharet, @ashione, @henktillman, @architkulkarni, @rkooo567, @zhe-thoughts, @amogkam, @kisuke95, @clarkzinzow, @holli, @raoul-khour-ts

    Source code(tar.gz)
    Source code(zip)
  • ray-0.8.7(Aug 13, 2020)

    Highlight

    • Ray is moving towards 1.0! It has had several important naming changes.
      • ObjectIDs are now called ObjectRefs because they are not just IDs.
      • The Ray Autoscaler is now called the Ray Cluster Launcher. The autoscaler will be a module of the Ray Cluster Launcher.
    • The Ray Cluster Launcher now has a much cleaner and concise output style. Try it out with ray up --log-new-style. The new output style will be enabled by default (with opt-out) in a later release.
    • Windows is now officially supported by RLlib. Multi node support for Windows is still in progress.

    Cluster Launcher/CLI (formerly autoscaler)

    • Highlight: This release contains a new colorful, concise output style for ray up and ray down, available with the --log-new-style flag. It will be enabled by default (with opt-out) in a later release. Full output style coverage for Cluster Launcher commands will also be available in a later release. (#9322, #9943, #9960, #9690)
    • Documentation improvements (with guides and new sections) (#9687
    • Improved Cluster launcher docker support (#9001, #9105, #8840)
    • Ray now has Docker images available on Docker hub. Please check out the ray image (#9732, #9556, #9458, #9281)
    • Azure improvements (#8938)
    • Improved on-prem cluster autoscaler (#9663)
    • Add option for continuous sync of file mounts (#9544)
    • Add ray status debug tool and ray --version (#9091, #8886).
    • ray memory now also supports redis_password (#9492)
    • Bug fixes for the Kubernetes cluster launcher mode (#9968)
    • Various improvements: disabling the cluster config cache (#8117), Python API requires keyword arguments (#9256), removed fingerprint checking for SSH (#9133), Initial support for multiple worker types (#9096), various changes to the internal node provider interface (#9340, #9443)

    Core

    • Support Python type checking for Ray tasks (#9574)
    • Rename ObjectID => ObjectRef (#9353)
    • New GCS Actor manager on by default (#8845, #9883, #9715, #9473, #9275)
    • Worker towards placement groups (#9039)
    • Plasma store process is merged with raylet (#8939, #8897)
    • Option to automatically reconstruct objects stored in plasma after a failure. See the documentation for more information. (#9394, #9557, #9488)
    • Many bug fixes.

    RLlib

    • New algorithm: “Model-Agnostic Meta-Learning” (MAML). An algo that learns and generalizes well across a distribution of environments.
    • New algorithm: “Model-Based Meta-Policy-Optimization” (MB-MPO). Our first model-based RL algo.
    • Windows is now officially supported by RLlib.
    • Native TensorFlow 2.x support. Use framework=”tf2” in your config to tap into TF2’s full potential. Also: SAC, DDPG, DQN Rainbow, ES, and ARS now run in TF1.x Eager mode.
    • DQN PyTorch support for full Rainbow setup (including distributional DQN).
    • Python type hints for Policy, Model, Offline, Evaluation, and Env classes.
    • Deprecated “Policy Optimizer” package (in favor of new distributed execution API).
    • Enhanced test coverage and stability.
    • Flexible multi-agent replay modes and replay_sequence_length. We now allow a) storing sequences (over time) in replay buffers and retrieving “lock-stepped” multi-agent samples.
    • Environments: Unity3D soccer game (tuned example/benchmark) and DM Control Suite wrapper and examples.
    • Various Bug fixes: QMIX not learning, DDPG torch bugs, IMPALA learning rate updates, PyTorch custom loss, PPO not learning MuJoCo due to action clipping bug, DQN w/o dueling layer error.

    Tune

    • API Changes:
      • The Tune Function API now supports checkpointing and is now usable with all search and scheduling algorithms! (#8471, #9853, #9517)
      • The Trainable class API has renamed many of its methods to be public (#9184)
    • You can now stop experiments upon convergence with Bayesian Optimization (#8808)
    • DistributedTrainableCreator, a simple wrapper for distributed parameter tuning with multi-node DistributedDataParallel models (#9550, #9739)
    • New integration and tutorial for using Ray Tune with Weights and Biases (Logger and native API) (#9725)
    • Tune now provides a Scikit-learn compatible wrapper for hyperparameter tuning (#9129)
    • New tutorials for integrations like XGBoost (#9060), multi GPU PyTorch (#9338), PyTorch Lightning (#9151, #9451), and Huggingface-Transformers (#9789)
    • CLI Progress reporting improvements (#8802, #9537, #9525)
    • Various bug fixes: handling of NaN values (#9381), Tensorboard logging improvements (#9297, #9691, #8918), enhanced cross-platform compatibility (#9141), re-structured testing (#9609), documentation reorganization and versioning (#9600, #9427, #9448)

    RaySGD

    • Variable worker CPU requirements (#8963)
    • Simplified cuda visible device setting (#8775)

    Serve

    • Horizontal scalability: Serve will now start one HTTP server per Ray node. (#9523)
    • Various performance improvement matching Serve to FastAPI (#9490,#8709, #9531, #9479 ,#9225, #9216, #9485)
    • API changes
      • serve.shadow_traffic(endpoint, backend, fraction) duplicates and sends a fraction of the incoming traffic to a specific backend. (#9106)
      • serve.shutdown() cleanup the current Serve instance in Ray cluster. (#8766)
      • Exception will be raised if num_replicas exceeds the maximum resource in the cluster (#9005)
    • Added doc examples for how to perform metric monitoring and model composition.

    Dashboard

    • Configurable Dashboard Port: The port on which the dashboard will run is now configurable using the argument --dashboard-port and the argument dashboard_port to ray.init
    • GPU monitoring improvements
      • For machines with more than one GPU, the GPU and GRAM utilization is now broken out on a per-GPU basis.
      • Assignments to physical GPUs are now shown at the worker level.
    • Sortable Machine View: It is now possible to sort the machine view by almost any of its columns by clicking next to the title. In addition, whereas the workers are normally grouped by node, you can now ungroup them if you only want to see details about workers.
    • Actor Search Bar: It is possible to search for actors by their title now (this is the class name of the actor in python in addition to the arguments it received.)
    • Logical View UI Updates: This includes things like color-coded names for each of the actor states, a more grid-like layout, and tooltips for the various data.
    • Sortable Memory View: Like the machine view, the memory view now has sortable columns and can be grouped / ungrouped by node.

    Windows Support

    • Improve GPU detection (#9300)
    • Work around msgpack issue on PowerPC64LE (#9140)

    Others

    • Ray Streaming Library Improvements (#9240, #8910, #8780)
    • Java Support Improvements (#9371, #9033, #9037, #9032, #8858, #9777, #9836, #9377)
    • Parallel Iterator Improvements (#8964, #8978)

    Thanks

    We thank the following contributors for their work on this release: @jsuarez5341, @amitsadaphule, @krfricke, @williamFalcon, @richardliaw, @heyitsmui, @mehrdadn, @robertnishihara, @gabrieleoliaro, @amogkam, @fyrestone, @mimoralea, @edoakes, @andrijazz, @ElektroChan89, @kisuke95, @justinkterry, @SongGuyang, @barakmich, @bloodymeli, @simon-mo, @TomVeniat, @lixin-wei, @alanwguo, @zhuohan123, @michaelzhiluo, @ijrsvt, @pcmoritz, @LecJackS, @sven1977, @ashione, @JerryLeeCS, @raphaelavalos, @stephanie-wang, @ruifangChen, @vnlitvinov, @yncxcw, @weepingwillowben, @goulou, @acmore, @wuisawesome, @gramhagen, @anabranch, @internetcoffeephone, @Alisahhh, @henktillman, @deanwampler, @p-christ, @Nicolaus93, @WangTaoTheTonic, @allenyin55, @kfstorm, @rkooo567, @ConeyLiu, @09wakharet, @piojanu, @mfitton, @KristianHolsheimer, @AmeerHajAli, @pdames, @ericl, @VishDev12, @suquark, @stefanbschneider, @raulchen, @dcfidalgo, @chappers, @aaarne, @chaokunyang, @sumanthratna, @clarkzinzow, @BalaBalaYi, @maximsmol, @zhongchun, @wumuzi520, @ffbin

    Source code(tar.gz)
    Source code(zip)
  • ray-0.8.6(Jun 24, 2020)

    Highlight

    • Experimental support for Windows is now available for single node Ray usage. Check out the Windows section below for known issues and other details.
    • Have you had troubles monitoring GPU or memory usage while you used Ray? The Ray dashboard now supports the GPU monitoring and a memory view.
    • Want to use RLlib with Unity? RLlib officially supports the Unity3D adapter! Please check out the documentation.
    • Ray Serve is ready for feedback! We've gotten feedback from many users, and Ray Serve is already being used in production. Please reach out to us with your use cases, ideas, documentation improvements, and feedback. We'd love to hear from you. Please do so on the Ray Slack and join #serve! Please see the Serve section below for more details.

    Core

    • We’ve introduced a new feature to automatically retry failed actor tasks after an actor has been restarted by Ray (by specifying max_restarts in @ray.remote). Try it out with max_task_retries=-1 where -1 indicates that the system can retry the task until it succeeds.

    API Change

    • To enable automatic restarts of a failed actor, you must now use max_restarts in the @ray.remote decorator instead of max_reconstructions. You can use -1 to indicate infinity, i.e., the system should always restart the actor if it fails unexpectedly.
    • We’ve merged the named and detached actor APIs. To create an actor that will survive past the duration of its job (a “detached” actor), specify name=<str> in its remote constructor (Actor.options(name='<str>').remote()). To delete the actor, you can use ray.kill.

    RLlib

    • PyTorch: IMPALA PyTorch version and all rllib/examples scripts now work for either TensorFlow or PyTorch (--torch command line option).
    • Switched to using distributed execution API by default (replaces Policy Optimizers) for all algorithms.
    • Unity3D adapter (supports all Env types: multi-agent, external env, vectorized) with example scripts for running locally or in the cloud.
    • Added support for variable length observation Spaces ("Repeated").
    • Added support for arbitrarily nested action spaces.
    • Added experimental GTrXL (Transformer/Attention net) support to RLlib + learning tests for PPO and IMPALA.
    • QMIX now supports complex observation spaces.

    API Change

    • Retire use_pytorch and eager flags in configs and replace these with framework=[tf|tfe|torch].
    • Deprecate PolicyOptimizers in favor of the new distributed execution API.
    • Retired support for Model(V1) class. Custom Models should now only use the ModelV2 API. There is still a warning when using ModelV1, which will be changed into an error message in the next release.
    • Retired TupleActions (in favor of arbitrarily nested action Spaces).

    Ray Tune / RaySGD

    • There is now a Dataset API for handling large datasets with RaySGD. (#7839)
    • You can now filter by an average of the last results using the ExperimentAnalysis tool (#8445).
    • BayesOptSearch received numerous contributions, enabling preliminary random search and warm starting. (#8541, #8486, #8488)

    API Changes

    • tune.report is now the right way to use the Tune function API. tune.track is deprecated (#8388)

    Serve

    • New APIs to inspect and manage Serve objects:
      • serve.list_backends and serve.list_endpoints (#8737)
      • serve.delete_backend and serve.delete_endpoint (#8252, #8256)
    • serve.create_endpoint now requires specifying the backend directly. You can remove serve.set_traffic if there's only one backend per endpoint. (#8764)
    • serve.init API cleanup, the following options were removed:
      • blocking, ray_init_kwargs, start_server (#8747, #8447, #8620)
    • serve.init now supports namespacing with name. You can run multiple serve clusters with different names on the same ray cluster. (#8449)
    • You can specify session affinity when splitting traffic with backends using X-SERVE-SHARD-KEY HTTP header. (#8449)
    • Various documentation improvements. Highlights:
      • A new section on how to perform A/B testing and incremental rollout (#8741)
      • Tutorial for batch inference (#8490)
      • Instructions for specifying GPUs and resources (#8495)

    Dashboard / Metrics

    • The Machine View of the dashboard now shows information about GPU utilization such as:
      • Average GPU/GRAM utilization at a node and cluster level
      • Worker-level information about how many GPUs each worker is assigned as well as its GRAM use.
    • The dashboard has a new Memory View tab that should be very useful for debugging memory issues. It has:
      • Information about objects in the Ray object store, including size and call-site
      • Information about reference counts and what is keeping an object pinned in the Ray object store.

    Small changes

    • IDLE workers get automatically sorted to the end of the worker list in the Machine View

    Autoscaler

    • Improved logging output. Errors are more clearly propagated and excess output has been reduced. (#7198, #8751, #8753)
    • Added support for k8s services.

    API Changes

    • ray up accepts remote URLs that point to the desired cluster YAML. (#8279)

    Windows support

    • Windows wheels are now available for basic experimental usage (via ray.init()).
    • Windows support is currently unstable. Unusual, unattended, or production usage is not recommended.
    • Various functionality may still lack support, including Ray Serve, Ray SGD, the autoscaler, the dashboard, non-ASCII file paths, etc.
    • Please check the latest nightly wheels & known issues (#9114), and let us know if any issue you encounter has not yet been addressed.
    • Wheels are available for Python 3.6, 3.7, and 3.8. (#8369)
    • redis-py has been patched for Windows sockets. (#8386)

    Others

    • Moving towards highly available Ray (#8650, #8639, #8606, #8601, #8591, #8442)
    • Java Support (#8730, #8640, #8637)
    • Ray streaming improvements (#8612, #8594, #7464)
    • Parallel iterator improvements (#8140, #7931, #8712)

    Thanks

    We thank the following contributors for their work on this release: @pcmoritz, @akharitonov, @devanderhoff, @ffbin, @anabranch, @jasonjmcghee, @kfstorm, @mfitton, @alecbrick, @simon-mo, @konichuvak, @aniryou, @wuisawesome, @robertnishihara, @ramanNarasimhan77, @09wakharet, @richardliaw, @istoica, @ThomasLecat, @sven1977, @ceteri, @acxz, @iamhatesz, @JarnoRFB, @rkooo567, @mehrdadn, @thomasdesr, @janblumenkamp, @ujvl, @edoakes, @maximsmol, @krfricke, @amogkam, @gehring, @ijrsvt, @internetcoffeephone, @LucaCappelletti94, @chaokunyang, @WangTaoTheTonic, @fyrestone, @raulchen, @ConeyLiu, @stephanie-wang, @suquark, @ashione, @Coac, @JosephTLucas, @ericl, @AmeerHajAli, @pdames

    Source code(tar.gz)
    Source code(zip)
  • ray-0.8.5(May 7, 2020)

    Highlight

    Core

    • Task cancellation is now available for locally submitted tasks. (#7699)
    • Experimental support for recovering objects that were lost from the Ray distributed memory store. You can try this out by setting lineage_pinning_enabled: 1 in the internal config. (#7733)

    RLlib

    • PyTorch support has now reached parity with TensorFlow. (#7926, #8188, #8120, #8101, #8106, #8104, #8082, #7953, #7984, #7836, #7597, #7797)
    • Improved callbacks API. (#6972)
    • Enable Ray distributed reference counting. (#8037)
    • Work towards customizable distributed training workflows. (#7958, #8077)

    Tune

    • Documentation has improved with a new format. (#8083, #8201, #7716)
    • Search algorithms are refactored to make them easier to extend, deprecating max_concurrent argument. (#7037, #8258, #8285)
    • TensorboardX errors are now handled safely. (#8174)
    • Bug fix in PBT checkpointing. (#7794)
    • New ZOOpt search algorithm added. (#7960)

    Serve

    • Improved APIs.
      • Add delete_endpoint and delete_backend. (#8252, #8256)
      • Use dictionary to update backend config. (#8202)
    • Added overview section to the documentation.
    • Added tutorials for serving models in Tensorflow/Keras, PyTorch, and Scikit-Learn.
    • Made serve clusters tolerant to process failures. (#8116, #8008,#7970,#7936)

    SGD

    • New Semantic Segmentation and HuggingFace GLUE Fine-tuning Examples. (#7792, #7825)
    • Fix GPU Reservations in SLURM usage. (#8157)
    • Update learning rate scheduler stepping parameter. (#8107)
    • Make serialization of data creation optional. (#8027)
    • Automatic DDP wrapping is now optional. (#7875)

    Others Projects

    • Progress towards the highly available and fault tolerant control plane. (#8144, #8119, #8145, #7909, #7949, #7771, #7557, #7675)
    • Progress towards the Ray streaming library. (#8044, #7827, #7955, #7961, #7348)
    • Autoscaler improvement. (#8178, #8168, #7986, #7844, #7717)
    • Progress towards Java support. (#8014)
    • Progress towards the Window compatibility. (#8237, #8186)
    • Progress towards cross language support. (#7711)

    Thanks

    We thank the following contributors for their work on this release:

    @simon-mo, @robertnishihara, @BalaBalaYi, @ericl, @kfstorm, @tirkarthi, @nflu, @ffbin, @chaokunyang, @ijrsvt, @pcmoritz, @mehrdadn, @sven1977, @iamhatesz, @nmatthews-asapp, @mitchellstern, @edoakes, @anabranch, @billowkiller, @eisber, @ujvl, @allenyin55, @yncxcw, @deanwampler, @DavidMChan, @ConeyLiu, @micafan, @rkooo567, @datayjz, @wizardfishball, @sumanthratna, @ashione, @marload, @stephanie-wang, @richardliaw, @jovany-wang, @MissiontoMars, @aannadi, @fyrestone, @JarnoRFB, @wumuzi520, @roireshef, @acxz, @gramhagen, @Servon-Lee, @ClarkZinzow, @mfitton, @maximsmol, @janblumenkamp, @istoica

    Source code(tar.gz)
    Source code(zip)
  • ray-0.8.4(Apr 2, 2020)

    Highlight

    • Add Python 3.8 support. (#7754)

    Core

    • Fix asycnio actor deserialization. (#7806)
    • Fix importing Pyarrow lead to symbol collison segfault. (#7568)
    • ray memory will collect statistics from all nodes. (#7721)
    • Pin lineage of plasma objects that are still in scope. (#7690)

    RLlib

    • Add contextual bandit algorithms. (#7642)
    • Add parameter noise exploration API. (#7772)
    • Add scaling guide. (#7780)
    • Enable restore keras model from h5 file. (#7482)
    • Store tf-graph by default when doing Policy.export_model(). (#7759)
    • Fix default policy overrides torch policy. (#7756, #7769)

    RaySGD

    • BREAKING: Add new API for tuning TorchTrainer using Tune. (#7547)
    • BREAKING: Convert the head worker to a local model. (#7746)
    • Added a new API for save/restore. (#7547)
    • Add tqdm support to TorchTrainer. (#7588)

    Tune

    • Add sorted columns and TensorBoard to Tune tab. (#7140)
    • Tune experiments can now be cancelled via the REST client. (#7719)
    • fail_fast enables experiments to fail quickly. (#7528)
    • override the IP retrieval process if needed. (#7705)
    • TensorBoardX nested dictionary support. (#7705)

    Serve

    • Performance improvements:
      • Push route table updates to HTTP proxy. (#7774)
      • Improve serialization. (#7688)
    • Add async methods support for serve actors. (#7682)
    • Add multiple method support for serve actors. (#7709)
      • You can specify HTTP methods in serve.create_backend(..., methods=["GET", "POST"]).
      • The ability to specify which actor method to execute in HTTP through X-SERVE-CALL-METHOD header or in RayServeHandle through handle.options("method").remote(...).

    Others

    • Progress towards highly available control plane. (#7822, #7742)
    • Progress towards Windows compatibility. (#7740, #7739, #7657)
    • Progress towards Ray Streaming library. (#7813)
    • Progress towards metrics export service. (#7809)
    • Basic C++ worker implementation. (#6125)

    Thanks

    We thank the following contributors for their work on this release:

    @carlbalmer, @BalaBalaYi, @saurabh3949, @maximsmol, @SongGuyang, @istoica, @pcmoritz, @aannadi, @kfstorm, @ijrsvt, @richardliaw, @mehrdadn, @wumuzi520, @cloudhan, @edoakes, @mitchellstern, @robertnishihara, @hhoke, @simon-mo, @ConeyLiu, @stephanie-wang, @rkooo567, @ffbin, @ericl, @hubcity, @sven1977

    Source code(tar.gz)
    Source code(zip)
  • ray-0.8.3(Mar 25, 2020)

    Highlights

    • Autoscaler has added Azure Support. (#7080, #7515, #7558, #7494)
      • Ray autoscaler helps you launch a distributed ray cluster using a single command line call!
      • It works on Azure, AWS, GCP, Kubernetes, Yarn, Slurm and local nodes.
    • Distributed reference counting is turned on by default. (#7628, #7337)
      • This means all ray objects are tracked and garbage collected only when all references go out of scope. It can be turned off with: ray.init(_internal_config=json.dumps({"distributed_ref_counting_enabled": 0})).
      • When the object store is full with objects that are still in scope, you can turn on least-recently-used eviction to force remove objects using ray.init(lru_evict=True).
    • A new command ray memory is added to help debug memory usage: (#7589)
      • It shows all object IDs that are in scope, their reference types, sizes and creation site.
        • Read more in the docs: https://ray.readthedocs.io/en/latest/memory-management.html.
    > ray memory
    -----------------------------------------------------------------------------------------------------
     Object ID                                Reference Type       Object Size   Reference Creation Site
    =====================================================================================================
    ; worker pid=51230
    ffffffffffffffffffffffff0100008801000000  PINNED_IN_MEMORY            8231   (deserialize task arg) __main__..sum_task
    ; driver pid=51174
    45b95b1c8bd3a9c4ffffffff010000c801000000  USED_BY_PENDING_TASK           ?   (task call) memory_demo.py:<module>:13
    ffffffffffffffffffffffff0100008801000000  USED_BY_PENDING_TASK        8231   (put object) memory_demo.py:<module>:6
    ef0a6c221819881cffffffff010000c801000000  LOCAL_REFERENCE                ?   (task call) memory_demo.py:<module>:14
    -----------------------------------------------------------------------------------------------------
    

    API change

    • Change actor.__ray_kill__() to ray.kill(actor). (#7360)
    • Deprecate use_pickle flag for serialization. (#7474)
    • Remove experimental.NoReturn. (#7475)
    • Remove experimental.signal API. (#7477)

    Core

    • Add Apache 2 license header to C++ files. (#7520)
    • Reduce per worker memory usage to 50MB. (#7573)
    • Option to fallback to LRU on OutOfMemory. (#7410)
    • Reference counting for actor handles. (#7434)
    • Reference counting for returning object IDs created by a different process. (#7221)
    • Use prctl(PR_SET_PDEATHSIG) on Linux instead of reaper. (#7150)
    • Route asyncio plasma through raylet instead of direct plasma connection. (#7234)
    • Remove static concurrency limit from gRPC server. (#7544)
    • Remove get_global_worker(), RuntimeContext. (#7638)
    • Fix known issues from 0.8.2 release:
      • Fix passing duplicate by-reference arguments. (#7306)
      • Fix Raise gRPC message size limit to 100MB. (#7269)

    RLlib

    • New features:
      • Exploration API improvements. (#7373, #7314, #7380)
      • SAC: add discrete action support. (#7320, #7272)
      • Add high-performance external application connector. (#7641)
    • Bug fix highlights:
      • PPO torch memory leak and unnecessary torch.Tensor creation and gc'ing. (#7238)
      • Rename sample_batch_size => rollout_fragment_length. (#7503)
      • Fix bugs and speed up SegmentTree.

    Tune

    • Integrate Dragonfly optimizer. (#5955)
    • Fix HyperBand errors. (#7563)
    • Access Trial Name, Trial ID inside trainable. (#7378)
    • Add a new repeater class for high variance trials. (#7366)
    • Prevent deletion of checkpoint from user-initiated restoration. (#7501)

    Libraries

    • [Parallel Iterators] Allow for operator chaining after repartition. (#7268)
    • [Parallel Iterators] Repartition functionality. (#7163)
    • [Serve] @serve.route returns a handle, add handle.scale, handle.set_max_batch_size. (#7569)
    • [RaySGD] PyTorchTrainer --> TorchTrainer. (#7425)
    • [RaySGD] Custom training API. (#7211)
    • [RaySGD] Breaking User API changes: (#7384)
      • data_creator fed to TorchTrainer now must return a dataloader rather than datasets.
      • TorchTrainer automatically sets "DistributedSampler" if a DataLoader is returned.
      • data_loader_config and batch_size are no longer parameters for TorchTrainer.
      • TorchTrainer parallelism is now set by num_workers.
      • All TorchTrainer args now must be named parameters.

    Java

    • New Java actor API (#7414)
      • @RayRemote annotation is removed.
      • Instead of Ray.call(ActorClass::method, actor), the new API is actor.call(ActorClass::method).
    • Allow passing internal config from raylet to Java worker. (#7532)
    • Enable direct call by default. (#7408)
    • Pass large object by reference. (#7595)

    Others

    • Progress towards Ray Streaming, including a Python API. (#7070, #6755, #7152, #7582)
    • Progress towards GCS Service for GCS fault tolerance. (#7292, #7592, #7601, #7166)
    • Progress towards cross language call between Java and Python. (#7614, #7634)
    • Progress towards Windows compatibility. (#7529, #7509, #7658, #7315)
    • Improvement in K8s Operator. (#7521, #7621, #7498, #7459, #7622)
    • New documentation for Ray Dashboard. (#7304)

    Known issues

    • Ray currently doesn't work on Python 3.5.0, but works on 3.5.3 and above.

    Thanks

    We thank the following contributors for their work on this release: @rkooo567, @maximsmol, @suquark, @mitchellstern, @micafan, @ClarkZinzow, @Jimpachnet, @mwbrulhardt, @ujvl, @chaokunyang, @robertnishihara, @jovany-wang, @hyeonjames, @zhijunfu, @datayjz, @fyrestone, @eisber, @stephanie-wang, @allenyin55, @BalaBalaYi, @simon-mo, @thedrow, @ffbin, @amogkam, @TisonKun, @richardliaw, @ijrsvt, @wumuzi520, @mehrdadn, @raulchen, @landcold7, @ericl, @edoakes, @sven1977, @ashione, @jorenretel, @gramhagen, @kfstorm, @anthonyhsyu, @pcmoritz

    Source code(tar.gz)
    Source code(zip)
  • ray-0.8.2(Feb 24, 2020)

    Highlights

    • Pyarrow is no longer vendored. Ray directly uses the C++ Arrow API. You can use any version of pyarrow with ray. (#7233)
    • The dashboard is turned on by default. It shows node and process information, actor information, and Ray Tune trials information. You can also use ray.show_in_webui to display custom messages for actors. Please try it out and send us feedback! (#6705, #6820, #6822, #6911, #6932, #6955, #7028, #7034)
    • We have made progress on distributed reference counting (behind a feature flag). You can try it out with ray.init(_internal_config=json.dumps({"distributed_ref_counting_enabled": 1})). It is designed to help manage memory using precise distributed garbage collection. (#6945, #6946, #7029, #7075, #7218, #7220, #7222, #7235, #7249)

    Breaking changes

    • Many experimental Ray libraries are moved to the util namespace. (#7100)
      • ray.experimental.multiprocessing => ray.util.multiprocessing
      • ray.experimental.joblib => ray.util.joblib
      • ray.experimental.iter => ray.util.iter
      • ray.experimental.serve => ray.serve
      • ray.experimental.sgd => ray.util.sgd
    • Tasks and actors are cleaned up if their owner process dies. (#6818)
    • The OMP_NUM_THREADS environment variable defaults to 1 if unset. This improves training performance and reduces resource contention. (#6998)
    • We now vendor psutil and setproctitle to support turning the dashboard on by default. Running import psutil after import ray will use the version of psutil that ships with Ray. (#7031)

    Core

    • The Python raylet client is removed. All raylet communication now goes through the core worker. (#6018)
    • Calling delete() will not delete objects in the in-memory store. (#7117)
    • Removed vanilla pickle serialization for task arguments. (#6948)
    • Fix bug passing empty bytes into Python tasks. (#7045)
    • Progress toward next generation ray scheduler. (#6913)
    • Progress toward service based global control store (GCS). (#6686, #7041)

    RLlib

    • Improved PyTorch support, including a PyTorch version of PPO. (#6826, #6770)
    • Added distributed SGD for PPO. (#6918, #7084)
    • Added an exploration API for controlling epsilon greedy and stochastic exploration. (#6974, #7155)
    • Fixed schedule values going negative past the end of the schedule. (#6971, #6973)
    • Added support for histogram outputs in TensorBoard. (#6942)
    • Added support for parallel and customizable evaluation step. (#6981)

    Tune

    • Improved Ax Example. (#7012)
    • Process saves asynchronously. (#6912)
    • Default to tensorboardx and include it in requirements. (#6836)
    • Added experiment stopping api. (#6886)
    • Expose progress reporter to users. (#6915)
    • Fix directory naming regression. (#6839)
    • Handles nan case for asynchyperband. (#6916)
    • Prevent memory checkpoints from breaking trial fault tolerance. (#6691)
    • Remove keras dependency. (#6827)
    • Remove unused tf loggers. (#7090)
    • Set correct path when deleting checkpoint folder. (#6758)
    • Support callable objects in variant generation. (#6849)

    Autoscaler

    • Ray nodes now respect docker limits. (#7039)
    • Add --all-nodes option to rsync-up. (#7065)
    • Add port-forwarding support for attach. (#7145)
    • For AWS, default to latest deep learning AMI. (#6922)
    • Added 'ray dashboard' command to proxy ray dashboard in remote machine. (#6959)

    Utility libraries

    • Support of scikit-learn with Ray joblib backend. (#6925)
    • Parallel iterator support local shuffle. (#6921)
    • [Serve] support no http headless services. (#7010)
    • [Serve] refactor router to use Ray asyncio support. (#6873)
    • [Serve] support composing arbitrary dags. (#7015)
    • [RaySGD] support fp16 via PyTorch apex. (#7061)
    • [RaySGD] refactor PyTorch sgd documentation. (#6910)
    • Improvement in Ray Streaming. (#7043, #6666, #7071)

    Other improvements

    • Progress toward Windows compatibility. (#6882, #6823)
    • Ray Kubernetes operator improvements. (#6852, #6851, #7091)
    • Java support for concurrent actor calls API. (#7022)
    • Java support for direct call for normal tasks. (#7193)
    • Java support for cross language Python invocation. (#6709)
    • Java support for cross language serialization for actor handles. (#7134)

    Known issue

    • Passing the same ObjectIDs multiple time as arguments currently doesn't work. (#7296)
    • Tasks can exceed gRPC max message size. (#7263)

    Thanks

    We thank the following contributors for their work on this release: @mitchellstern, @hugwi, @deanwampler, @alindkhare, @ericl, @ashione, @fyrestone, @robertnishihara, @pcmoritz, @richardliaw, @yutaizhou, @istoica, @edoakes, @ls-daniel, @BalaBalaYi, @raulchen, @justinkterry, @roireshef, @elpollouk, @kfstorm, @Bassstring, @hhbyyh, @Qstar, @mehrdadn, @chaokunyang, @flying-mojo, @ujvl, @AnanthHari, @rkooo567, @simon-mo, @jovany-wang, @ijrsvt, @ffbin, @AmeerHajAli, @gaocegege, @suquark, @MissiontoMars, @zzyunzhi, @sven1977, @stephanie-wang, @amogkam, @wuisawesome, @aannadi, @maximsmol

    Source code(tar.gz)
    Source code(zip)
  • ray-0.8.1(Jan 27, 2020)

    Ray 0.8.1 Release Notes

    Highlights

    • ObjectIDs corresponding to ray.put() objects and task returns are now reference counted locally in Python and when passed into a remote task as an argument. ObjectIDs that have a nonzero reference count will not be evicted from the object store. Note that references for ObjectIDs passed into remote tasks inside of other objects (e.g., f.remote((ObjectID,)) or f.remote([ObjectID])) are not currently accounted for. (#6554)
    • asyncio actor support: actors can now define async def method and Ray will run multiple method invocations in the same event loop. The maximum concurrency level can be adjusted with ActorClass.options(max_concurrency=2000).remote().
    • asyncio ObjectID support: Ray ObjectIDs can now be directly awaited using the Python API. await my_object_id is similar to ray.get(my_object_id), but allows context switching to make the operation non-blocking. You can also convert an ObjectID to a asyncio.Future using ObjectID.as_future().
    • Added experimental parallel iterators API (#6644, #6726): ParallelIterators can be used to more convienently load and process data into Ray actors. See the documentation for details.
    • Added multiprocessing.Pool API (#6194): Ray now supports the multiprocessing.Pool API out of the box, so you can scale existing programs up from a single node to a cluster by only changing the import statment. See the documentation for details.

    Core

    • Deprecated Python 2 (#6581, #6601, #6624, #6665)
    • Fixed bug when failing to import remote functions or actors with args and kwargs (#6577)
    • Many improvements to the dashboard (#6493, #6516, #6521, #6574, #6590, #6652, #6671, #6683, #6810)
    • Progress towards Windows compatibility (#6446, #6548, #6653, #6706)
    • Redis now binds to localhost and has a password set by default (#6481)
    • Added actor.__ray_kill__() to terminate actors immediately (#6523)
    • Added 'ray stat' command for debugging (#6622)
    • Added documentation for fault tolerance behavior (#6698)
    • Treat static methods as class methods instead of instance methods in actors (#6756)

    RLlib

    • DQN distributional model: Replace all legacy tf.contrib imports with tf.keras.layers.xyz or tf.initializers.xyz (#6772)
    • SAC site changes (#6759)
    • PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch) (#6650)
    • SAC for Mujoco Environments (#6642)
    • Tuple action dist tensors not reduced properly in eager mode (#6615)
    • Changed foreach_policy to foreach_trainable_policy (#6564)
    • Wrapper for the dm_env interface (#6468)

    Tune

    • Get checkpoints paths for a trial after tuning (#6643)
    • Async restores and S3/GCP-capable trial FT (#6376)
    • Usability errors PBT (#5972)
    • Demo exporting trained models in pbt examples (#6533)
    • Avoid duplication in TrialRunner execution (#6598)
    • Update params for optimizer in reset_config (#6522)
    • Support Type Hinting for py3 (#6571)

    Other Libraries

    • [serve] Pluggable Queueing Policy (#6492)
    • [serve] Added BackendConfig (#6541)
    • [sgd] Fault tolerance support for pytorch + revamp documentation (#6465)

    Thanks

    We thank the following contributors for their work on this release:

    @chaokunyang, @Qstar, @simon-mo, @wlx65003, @stephanie-wang, @alindkhare, @ashione, @harrisonfeng, @JingGe, @pcmoritz, @zhijunfu, @BalaBalaYi, @kfstorm, @richardliaw, @mitchellstern, @michaelzhiluo, @ziyadedher, @istoica, @EyalSel, @ffbin, @raulchen, @edoakes, @chenk008, @frthjf, @mslapek, @gehring, @hhbyyh, @zzyunzhi, @zhu-eric, @MissiontoMars, @sven1977, @walterddr, @micafan, @inventormc, @robertnishihara, @ericl, @ZhongxiaYan, @mehrdadn, @jovany-wang, @ujvl, @bharatpn

    Source code(tar.gz)
    Source code(zip)
Xeasy-ml is a packaged machine learning framework.

xeasy-ml 1. What is xeasy-ml Xeasy-ml is a packaged machine learning framework. It allows a beginner to quickly build a machine learning model and use

null 9 Mar 14, 2022
DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

null 27 Jun 20, 2022
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

null 2.5k Jun 9, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 22.9k Jun 22, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.7k Jun 29, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 13.9k Jun 21, 2022
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

null 4k Jun 29, 2022
A basic Ray Tracer that exploits numpy arrays and functions to work fast.

Python-Fast-Raytracer A basic Ray Tracer that exploits numpy arrays and functions to work fast. The code is written keeping as much readability as pos

Rafael de la Fuente 358 Jun 19, 2022
Uber Open Source 1.4k Jun 20, 2022
🎛 Distributed machine learning made simple.

?? lazycluster Distributed machine learning made simple. Use your preferred distributed ML framework like a lazy engineer. Getting Started • Highlight

Machine Learning Tooling 43 Nov 23, 2021
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make dis

Horovod 12.5k Jun 21, 2022
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.2k Jun 28, 2022
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

null 86 Jun 25, 2022
A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Nicholas Monath 29 Feb 13, 2022
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 10x Larger Models 10x Faster Trainin

Microsoft 6.9k Jun 21, 2022
Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning

The mljar-supervised is an Automated Machine Learning Python package that works with tabular data. I

MLJAR 1.9k Jun 19, 2022
A handy tool for common machine learning models' hyper-parameter tuning.

Common machine learning models' hyperparameter tuning This repo is for a collection of hyper-parameter tuning for "common" machine learning models, in

Kevin Hu 2 Jan 27, 2022
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 2.9k Jun 23, 2022
Distributed Computing for AI Made Simple

Project Home Blog Documents Paper Media Coverage Join Fiber users email list [email protected] Fiber Distributed Computing for AI Made Simp

Uber Open Source 971 Jun 13, 2022