An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Last update: Dec 31, 2022

Related tags

Machine Learning python java data-science machine-learning reinforcement-learning deep-learning deployment optimization parallel distributed model-selection hyperparameter-optimization ray automl hyperparameter-search serving rllib

Overview

https://readthedocs.org/projects/ray/badge/?version=master

https://img.shields.io/badge/Ray-Join%20Slack-blue

https://img.shields.io/badge/Discuss-Ask%20Questions-blue

Ray provides a simple, universal API for building distributed applications.

Ray is packaged with the following libraries for accelerating machine learning workloads:

Tune: Scalable Hyperparameter Tuning
RLlib: Scalable Reinforcement Learning
RaySGD: Distributed Training Wrappers
Ray Serve: Scalable and Programmable Serving

There are also many community integrations with Ray, including Dask, MARS, Modin, Horovod, Hugging Face, Scikit-learn, and others. Check out the full list of Ray distributed libraries here.

Install Ray with: pip install ray. For nightly wheels, see the Installation page.

Quick Start

Execute Python functions in parallel.

import ray
ray.init()

@ray.remote
def f(x):
    return x * x

futures = [f.remote(i) for i in range(4)]
print(ray.get(futures))

To use Ray's actor model:

import ray
ray.init()

@ray.remote
class Counter(object):
    def __init__(self):
        self.n = 0

    def increment(self):
        self.n += 1

    def read(self):
        return self.n

counters = [Counter.remote() for i in range(4)]
[c.increment.remote() for c in counters]
futures = [c.read.remote() for c in counters]
print(ray.get(futures))

Ray programs can run on a single machine, and can also seamlessly scale to large clusters. To execute the above Ray script in the cloud, just download this configuration file, and run:

ray submit [CLUSTER.YAML] example.py --start

Tune Quick Start

Tune is a library for hyperparameter tuning at any scale.

Launch a multi-node distributed hyperparameter sweep in less than 10 lines of code.
Supports any deep learning framework, including PyTorch, PyTorch Lightning, TensorFlow, and Keras.
Visualize results with TensorBoard.
Choose among scalable SOTA algorithms such as Population Based Training (PBT), Vizier's Median Stopping Rule, HyperBand/ASHA.
Tune integrates with many optimization libraries such as Facebook Ax, HyperOpt, and Bayesian Optimization and enables you to scale them transparently.

To run this example, you will need to install the following:

$ pip install "ray[tune]"

This example runs a parallel grid search to optimize an example objective function.

from ray import tune


def objective(step, alpha, beta):
    return (0.1 + alpha * step / 100)**(-1) + beta * 0.1


def training_function(config):
    # Hyperparameters
    alpha, beta = config["alpha"], config["beta"]
    for step in range(10):
        # Iterative training function - can be any arbitrary training procedure.
        intermediate_score = objective(step, alpha, beta)
        # Feed the score back back to Tune.
        tune.report(mean_loss=intermediate_score)


analysis = tune.run(
    training_function,
    config={
        "alpha": tune.grid_search([0.001, 0.01, 0.1]),
        "beta": tune.choice([1, 2, 3])
    })

print("Best config: ", analysis.get_best_config(metric="mean_loss", mode="min"))

# Get a dataframe for analyzing trial results.
df = analysis.results_df

If TensorBoard is installed, automatically visualize all trial results:

tensorboard --logdir ~/ray_results

RLlib Quick Start

RLlib is an open-source library for reinforcement learning built on top of Ray that offers both high scalability and a unified API for a variety of applications.

pip install tensorflow  # or tensorflow-gpu
pip install "ray[rllib]"

import gym
from gym.spaces import Discrete, Box
from ray import tune

class SimpleCorridor(gym.Env):
    def __init__(self, config):
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0
        self.action_space = Discrete(2)
        self.observation_space = Box(0.0, self.end_pos, shape=(1, ))

    def reset(self):
        self.cur_pos = 0
        return [self.cur_pos]

    def step(self, action):
        if action == 0 and self.cur_pos > 0:
            self.cur_pos -= 1
        elif action == 1:
            self.cur_pos += 1
        done = self.cur_pos >= self.end_pos
        return [self.cur_pos], 1 if done else 0, done, {}

tune.run(
    "PPO",
    config={
        "env": SimpleCorridor,
        "num_workers": 4,
        "env_config": {"corridor_length": 5}})

Ray Serve Quick Start

Ray Serve is a scalable model-serving library built on Ray. It is:

Framework Agnostic: Use the same toolkit to serve everything from deep learning models built with frameworks like PyTorch or Tensorflow & Keras to Scikit-Learn models or arbitrary business logic.
Python First: Configure your model serving with pure Python code - no more YAMLs or JSON configs.
Performance Oriented: Turn on batching, pipelining, and GPU acceleration to increase the throughput of your model.
Composition Native: Allow you to create "model pipelines" by composing multiple models together to drive a single prediction.
Horizontally Scalable: Serve can linearly scale as you add more machines. Enable your ML-powered service to handle growing traffic.

To run this example, you will need to install the following:

$ pip install scikit-learn
$ pip install "ray[serve]"

This example runs serves a scikit-learn gradient boosting classifier.

from ray import serve
import pickle
import requests
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier

# Train model
iris_dataset = load_iris()
model = GradientBoostingClassifier()
model.fit(iris_dataset["data"], iris_dataset["target"])

# Define Ray Serve model,
class BoostingModel:
    def __init__(self):
        self.model = model
        self.label_list = iris_dataset["target_names"].tolist()

    def __call__(self, flask_request):
        payload = flask_request.json["vector"]
        print("Worker: received flask request with data", payload)

        prediction = self.model.predict([payload])[0]
        human_name = self.label_list[prediction]
        return {"result": human_name}


# Deploy model
client = serve.start()
client.create_backend("iris:v1", BoostingModel)
client.create_endpoint("iris_classifier", backend="iris:v1", route="/iris")

# Query it!
sample_request_input = {"vector": [1.2, 1.0, 1.1, 0.9]}
response = requests.get("http://localhost:8000/iris", json=sample_request_input)
print(response.text)
# Result:
# {
#  "result": "versicolor"
# }

More Information

Documentation
Tutorial
Blog
Ray 1.0 Architecture whitepaper (new)
Ray Design Patterns (new)
RLlib paper
RLlib flow paper
Tune paper

Older documents:

Getting Involved

Forum: For discussions about development, questions about usage, and feature requests.
GitHub Issues: For reporting bugs.
Twitter: Follow updates on Twitter.
Meetup Group: Join our meetup group.
StackOverflow: For questions about how to use Ray.

Comments

[WIP] Implement Ape-X distributed prioritization
What do these changes do?

This implements https://openreview.net/forum?id=H1Dy---0Z for testing. The main ideas from Ape-X are:

Worker-side prioritization: rather than take new samples as max priority, prioritize them in workers. This scales experience gathering.

Per-worker exploration: Rather than choosing a single exploration schedule, assign each worker a different exploration value ranging from 0.4 to ~0.0.

WIP: evaluation on pong. This implementation probably doesn't scale to very high sample throughputs, but we should probably be able to see some gains on a couple dozen workers.
opened by ericl 199
[ray-core] Initial addition of performance integration testing files
A Dockerfile specific for this test

This is needed because we eventually will upload these numbers to S3

Addition of simple performance test for time it takes for a number of variable number of tasks with a variable number of CPUs

A couple of bash scripts to setup the Docker environment and run the tests

What do these changes do?

Related issue number
opened by devin-petersohn 134
Make Bazel the default build system

What do these changes do?

This switches the build system from CMake to Bazel for developers.

The wheels, valgrind tests and Jenkins are currently still run with CMake and will be switched in follow up PRs.

Related issue number

opened by pcmoritz 130
Streaming data transfer and python integration
Why are these changes needed?

This is the minimal implementation of streaming data transfer mentioned in doc, consisting of three parts:

writer/reader, implemented with C++ to transfer data between streaming workers

streaming queue, the transport layer based on Ray’s direct actor call and C++ Core Worker APIs

adaption layer for python, implemented with cython to adapt writer/reader for python

To integrate python with streaming c++ data transfer, following changes are made:

We moved python code from python/ray/experimental/streaming/ to streaming/python/ray/streaming, and soft link to python/ray/streaming just like rllib.

We removed batched_queue and added cython based streaming queue implementation.

We moved execution graph related logic from Environment into ExecutionGraph.

We refactored operator_instance into a processor, and added a JobWorker actor to execute processors.

The java part will be submitted in following PRs.

Related issue number

#6184

Checks

[x] I've run scripts/format.sh to lint the changes in this PR.

[ ] I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.

[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
opened by chaokunyang 125
[core worker] Python core worker object interface
What do these changes do?

This change adds a new Cython extension for the core worker and calls into it for all object-store-related operations.

To support this, it also adds two new methods to the core worker ObjectInterface that resemble the plasma store interface (Create and Seal). These allow us to directly write from Python memory into the object store via Arrow's SerializedPyObject without having to compile the Ray and Arrow Cython code together or add a dependency to Arrow in the core worker.

Related issue number

Linter

[x] I've run scripts/format.sh to lint the changes in this PR.
opened by edoakes 117
Discussion on batch Garbage Collection.

Hi @robertnishihara @pcmoritz , we are planning to add a batch Garbage Collection to Ray.

We have a concept called batchId (int64_t) used to do the Garbage Collection. For example, one job will use this batchId to generate all the objectIds and taskIds, and all these objectIds and taskIds will be stored under the Garbage Collection Table under the batchId in GCS. When the job is finished, we can simply pass a batchId to the garbage collector and the garbage collector will look up the Garbage Collection table in GCS and do the garbage collection to all the related tasks and objects.

In current id.h implementation, the lowest 32 bits in ObjectId is used for Object Index. We can use the higher 64 bits next to the Object Index as the batchId and add a new GC Table in GCS.

This GC mechanism will help release the memory resources in GCS and plasma. How do you think of this code change?

opened by guoyuhong 112
[tune] Cluster Fault Tolerance
What do these changes do?

A redo of #3165 with extraneous cleanup changes removed.

This currently does not use the same restoring code-path as #3238, but this can change later when component FT is implemented... (i.e., this doesn't notify components that some trials go RUNNING -> PENDING).

This adds the following functionality:

pickleable trials and TrialRunner.

checkpointing/restoring functionality for Trial runner

user endpoints for experiment checkpointing

Example:

In [6]: import time ...: import ray ...: from ray import tune ...: ...: ray.init() ...: ...: kwargs = dict( ...: run="__fake", ...: stop=dict(training_iteration=5), ...: checkpoint_freq=1, ...: max_failures=1) ...: ...: # This will save the experiment state to disk on each step ...: tune.run_experiments( ...: dict(experiment1=kwargs), ...: raise_on_failed_trial=False) ...:

TODO:

[x] User endpoints implemented.

[x] NODE FT: Add test for scheduler notification when nodes die and trials running -> pending

NOTE: this should be a lot easier to review after #3414 is merged.
opened by richardliaw 110
GCS-Based actor management implementation
Why are these changes needed?

Pls see the <Design Document> first.

This PR implements the creation and reconstruction of actors based on gcs server.

Changes on gcs server side

Several important classes are added: GcsActor, GcsActorManager, GcsActorScheduler.

GcsActor: An abstraction of actor at GcsServer side, which wrapper the ActorTableData and provides some simple interface to access the field inside ActorTableData.

GcsActorManager: It is responsible for managing the lifecycle of all registered actors.

GcsActorScheduler: It is responsible for scheduling actors registered to GcsActorManager, it also contains a inner class called GcsLeasedWorker which is an abstraction of remote leased worker in raylet.

In addition, this PR has also made some changes to GcsNodeManager, it is responsible for monitoring and manage nodes.

Changes on raylet side

In the old actor management scheme, raylet will be responsible for updating ActorTableData, while in the new GCS-Based actor management scheme, we expect that GCS will be responsible for updating all ActorTableData. So, you will see that all logic about updating ActorTableData will be get ride off.

Besides, the raylet should cache the relationship of actor and leased worker because that the raylet should fast reply gcs server without lease anything when gcs server rebuild actors after restart. Pls see the <Design Document>.

Chages on worker side

invoke the gcs_rpc_client.CreateActor on the callback of ResolveDependencies.

Fast reply the gcs server without create anyting if it is already bound with an actor when gcs server rebuild actors rebuild actors after restart.

Related issue number

Checks

[ ] I've run scripts/format.sh to lint the changes in this PR.

[ ] I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.

[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
opened by wumuzi520 107
[carla] [rllib] Add support for carla nav planner and scenarios from paper

What do these changes do?

This adds navigation input from the carla planner, and also the ability to run all the scenarios from the CoRL 2017 paper.

Train scripts are updated to use a custom model that supports the planner input and nav metrics.

opened by ericl 107
[RLlib] Move all jenkins RLlib-tests into bazel (rllib/BUILD).
Why are these changes needed?

Related issue number

Checks

[x] I've run scripts/format.sh to lint the changes in this PR.

[x] I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.

[x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
opened by sven1977 102
Experimental asyncio support
What do these changes do?

This is a prototype implementation for https://github.com/ray-project/ray/issues/493 which provides awaitable interface for ray.wait & ray's ObjectID

As a prototype, these codes are meant to be modified later.

How do these changes work?

AsyncPlasmaClient is implemeted to override original pyarrow.plasma.PlasmaClient. pyarrow.plasma.PlasmaClient is created by pyarrow.plasma.connect and is attached to ray.worker.global_worker to handle basic ray functions. It also create an interface for wrapping ray's ObjectID.

AsyncPlasmaSocket is created for async socket messaging with PlasmaStore & PlasmaManager. It is the core of async. pyarrow.plasma.PlasmaClient does not make use of event loops and only create a single socket connection, it is why original ray does not support much of async functions. AsyncPlasmaSocket uses asyncio event loop and is capable of creating multiple socket connections with PlasmaManager.

plasma.fbs under format directory needs to be compiled with flatbuffer ahead of time.

Related issue number

https://github.com/ray-project/ray/issues/493

cc @mitar
opened by suquark 101
[] @ray.remote function runs on head node instead of worker node
What happened + What you expected to happen

I run a python function on ray cluster with @ray.remote, but the function runs on head node instead of worker node

Versions / Dependencies

Ray version: 2.0.0

Reproduction script

import ray @ray.remote def test_cpu3(): import time time.sleep(150) return "done" my_task = test_cpu3.remote() ray.get(my_task)

Issue Severity

None
bug triage
opened by Yifan122 0
[Core][deprecate run_on_all_workers 1/n] set worker's sys.path through JobConfig.code_search_path
Why are these changes needed?

Today we use run_on_all_workers to set worker's system path, where the run_on_all_workers suffers from weak ordering guarantees and will be deprecated.

Instead, we should use JobConfig.code_search_path to set worker's system path, which is already handled on worker startup: https://github.com/ray-project/ray/blob/master/python/ray/_private/workers/default_worker.py#L225

Related issue number

Checks

[ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.

[ ] I've run scripts/format.sh to lint the changes in this PR.

[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.

[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/

Testing Strategy

[ ] Unit tests

[ ] Release tests

[ ] This PR is not tested :(
opened by scv119 0
[data](deps): Bump aioboto3 from 8.3.0 to 10.2.0 in /python/requirements/data_processing
Bumps aioboto3 from 8.3.0 to 10.2.0.

Changelog

Sourced from aioboto3's changelog.

10.2.0 (2022-12-03)

Updated S3 streaming example

Bumped aiobotocore to 2.4.1

10.1.0 (2022-09-21)

Bumped aiobotocore to 2.4.0 - thanks @abivolmv

10.0.0 (2022-08-10)

Bumped aiobotocore to 2.3.4 - thanks @dacevedo12

Fixed async pytest fixtures which now work in pytest-asyncio strict mode

Fixed edge case in dynamodb batch writer loosing uncommitted writes - see #270, thanks @JamesVerrill

9.6.0 (2022-05-06)

Bumped aiobotocore to 2.3.0

9.5.0 (2022-03-29)

Bumped aiobotocore to 2.2.0 - thanks @dacevedo12

Updated formatting in various places to match the boto3 equivalent

9.4.0 (2022-03-13)

Bumped aiobotocore to 2.1.2

Updated asyncio.wait usage to be compatible with py3.11 - thanks @noblepayne

Fixed resource aexit not being used properly - thanks @chrisBLIT

Added S3 CopyFrom test coverage

Bumped Moto to 3.1.0

9.3.1 (2022-01-10)

Bumped aiobotocore to 2.1.0 - thanks @abivolmv

9.3.0 (2021-12-13)

Bumped aiobotocore to 2.0.1 - thanks @mmaslowskicc

9.2.2 (2021-10-06)

... (truncated)

Commits

ca796f0 Updated dependencies

f6b8ddb Update s3 streaming example

6e9cd0a Updated patch tests

524973e Fixed CI

e89f63c bump to aiobotocore 2.4.0

4d85e5f Bumped version to v10.0.0

07d6cea Updated async fixtures

07917bd Updated aiobotocore

2deccf0 fix the batch writer data loss bug which occurs with concurrent tasks using s...

7f23e5b Updated to aiobotocore 2.3.0

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies python
opened by dependabot[bot] 0

core - ray logs CLI doesn't work for kubernetes raycluster

What happened + What you expected to happen

I have port forwarded the GCS in a ray cluster in kubernetes to 127.0.0.1:6379 and the dashboard to 127.0.0.1:8265

$ ray logs cluster --address 127.0.0.1:6379
Traceback (most recent call last):
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/lib/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1040, in _send_output
    self.send(msg)
  File "/usr/lib/python3.9/http/client.py", line 980, in send
    self.connect()
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fa3143e7070>: Failed to establish a new connection: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.42.0.16', port=8265): Max retries exceeded with url: /api/v0/logs?node_ip=10.97.36.20&glob=%2A&timeout=30 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa3143e7070>: Failed to establish a new connection: [Errno 110] Connection timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/compute/code/ray-demo/.venv/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2386, in main
    return cli()
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/ray/experimental/state/state_cli.py", line 930, in log_cluster
    logs = list_logs(
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/ray/experimental/state/api.py", line 1250, in list_logs
    r = requests.get(
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/home/compute/code/ray-demo/.venv/lib/python3.9/site-packages/requests/adapters.py", line 565, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.42.0.16', port=8265): Max retries exceeded with url: /api/v0/logs?node_ip=10.97.36.20&glob=%2A&timeout=30 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa3143e7070>: Failed to establish a new connection: [Errno 110] Connection timed out'))

It's trying to connect to the dashboard port on 10.42.0.16, ie: the head node's IP in the cluster, rather than 127.0.0.1.

ray status works fine:

ray status --address 127.0.0.1:6379 -v
======== Autoscaler status: 2022-12-31 03:58:49.312919 ========
GCS request time: 0.001129s
Node Provider non_terminated_nodes time: 0.038014s

Node status
---------------------------------------------------------------
Healthy:
 1 workergroup
 1 head-group
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/2.0 CPU
 0.00/3.725 GiB memory
 0.00/1.030 GiB object_store_memory

Total Demands:
 (no resource demands)

Node: 10.42.0.16
 Usage:
  0.00/1.863 GiB memory
  0.00/0.495 GiB object_store_memory

Node: 10.42.1.19
 Usage:
  0.0/2.0 CPU
  0.00/1.863 GiB memory
  0.00/0.535 GiB object_store_memory

Versions / Dependencies

ray 2.2.0

Reproduction script

Create a raycluster in kubernetes.
Use the ray cli to connect to the exposed address of the GCS server, eg: ray logs cluster --address 127.0.0.1:6379

Issue Severity

Low: It annoys or frustrates me.

bug triage

opened by tekumara 0

[Core] ray submit --stop fails on aws
What happened + What you expected to happen

I would like to stop my remote computations by setting the --stop flag when using ray submit. However, this flag seems to be broken, at least with aws, as it results in the following error:

Failed to fetch IAM instance profile data for ray-autoscaler-v1 from AWS. Error code: AccessDenied !!! Boto3 error: An error occurred (AccessDenied) when calling the GetInstanceProfile operation: User: arn:aws:sts::961518004113:assumed-role/ray-autoscaler-v1/i-003a939b25fb7e013 is not authorized to perform: iam:GetInstanceProfile on resource: instance profile ray-autoscaler-v1 because no identity-based policy allows the iam:GetInstanceProfile action !!!

It works fine without --stop and manually shutting down with ray down, but that requires me to be around when the computation finishes.

Versions / Dependencies

I've tried this with ray 2.2 on both ubuntu with python 3.9 and 3.10.

Reproduction script

Here is a simple reproduction, although it takes a while due to ray's docker image being very large.

ray exec example-full.yaml --start --stop 'echo "hello world"'

Issue Severity

Low: It annoys or frustrates me.
bug P1 infra clusters
opened by vladfi1 1
[Serve] Return 400 response code for malformed Serve configs sent via REST API
Signed-off-by: Shreyas Krishnaswamy [email protected]

Why are these changes needed?

When users submit a malformed Serve config (e.g. with invalid fields) via PUT request, they receive a 500 error. This change makes the REST API return a 400 (unretryable) error instead.

Related issue number

Closes #31370

Checks

[x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.

[x] I've run scripts/format.sh to lint the changes in this PR.

[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.

[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/

Testing Strategy

[x] Unit tests

New unit tests are added to test_serve_agent.py.
opened by shrekris-anyscale 0

Releases(ray-2.2.0)

ray-2.2.0(Dec 13, 2022)
Release Highlights

Ray 2.2 is a stability-focused release, featuring stability improvements across many Ray components.

Ray Jobs API is now GA. The Ray Jobs API allows you to submit locally developed applications to a remote Ray Cluster for execution. It simplifies the experience of packaging, deploying, and managing a Ray application.

Ray Dashboard has received a number of improvements, such as the ability to see cpu flame graphs of your Ray workers and new metrics for memory usage.

The Out-Of-Memory (OOM) Monitor is now enabled by default. This will increase the stability of memory-intensive applications on top of Ray.

[Ray Data] we’ve heard numerous users report that when files are too large, Ray Data can have out-of-memory or performance issues. In this release, we’re enabling dynamic block splitting by default, which will address the above issues by avoiding holding too much data in memory.

Ray Libraries

Ray AIR

🎉 New Features:

Add a NumPy first path for Torch and TensorFlow Predictors (#28917)

💫Enhancements:

Suppress "NumPy array is not writable" error in torch conversion (#29808)

Add node rank and local world size info to session (#29919)

🔨 Fixes:

Fix MLflow database integrity error (#29794)

Fix ResourceChangingScheduler dropping PlacementGroupFactory args (#30304)

Fix bug passing 'raise' to FailureConfig (#30814)

Fix reserved CPU warning if no CPUs are used (#30598)

📖Documentation:

Fix examples and docs to specify batch_format in BatchMapper (#30438)

🏗 Architecture refactoring:

Deprecate Wandb mixin (#29828)

Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (#30365)

Ray Data Processing

🎉 New Features:

Support all PyArrow versions released by Apache Arrow (#29993, #29999)

Add select_columns() to select a subset of columns (#29081)

Add write_tfrecords() to write TFRecord files (#29448)

Support MongoDB data source (#28550)

Enable dynamic block splitting by default (#30284)

Add from_torch() to create dataset from Torch dataset (#29588)

Add from_tf() to create dataset from TensorFlow dataset (#29591)

Allow to set batch_size in BatchMapper (#29193)

Support read/write from/to local node file system (#29565)

💫Enhancements:

Add include_paths in read_images() to return image file path (#30007)

Print out Dataset statistics automatically after execution (#29876)

Cast tensor extension type to opaque object dtype in to_pandas() and to_dask() (#29417)

Encode number of dimensions in variable-shaped tensor extension type (#29281)

Fuse AllToAllStage and OneToOneStage with compatible remote args (#29561)

Change read_tfrecords() output from Pandas to Arrow format (#30390)

Handle all Ray errors in task compute strategy (#30696)

Allow nested Chain preprocessors (#29706)

Warn user if missing columns and support str exclude in Concatenator (#29443)

Raise ValueError if preprocessor column doesn't exist (#29643)

🔨 Fixes:

Support custom resource with remote args for random_shuffle() (#29276)

Support custom resource with remote args for random_shuffle_each_window() (#29482)

Add PublicAPI annotation to preprocessors (#29434)

Tensor extension column concatenation fixes (#29479)

Fix iter_batches() to not return empty batch (#29638)

Change map_batches() to fetch input blocks on-demand (#29289)

Change take_all() to not accept limit argument (#29746)

Convert between block and batch correctly for map_groups() (#30172)

Fix stats() call causing Dataset schema to be unset (#29635)

Raise error when batch_format is not specified for BatchMapper (#30366)

Fix ndarray representation of single-element ragged tensor slices (#30514)

📖Documentation:

Improve map_batches() documentation about execution model and UDF pickle-ability requirement (#29233)

Improve to_tf() docstring (#29464)

Ray Train

🎉 New Features:

Added MosaicTrainer (#29237, #29620, #29919)

💫Enhancements:

Fast fail upon single worker failure (#29927)

Optimize checkpoint conversion logic (#29785)

🔨 Fixes:

Propagate DatasetContext to training workers (#29192)

Show correct error message on training failure (#29908)

Fix prepare_data_loader with enable_reproducibility (#30266)

Fix usage of NCCL_BLOCKING_WAIT (#29562)

📖Documentation:

Deduplicate Train examples (#29667)

🏗 Architecture refactoring:

Hard deprecate train.report (#29613)

Remove deprecated Train modules (#29960)

Deprecate old prepare_model DDP args #30364

Ray Tune

🎉 New Features:

Make Tuner.restore work with relative experiment paths (#30363)

Tuner.restore from a local directory that has moved (#29920)

💫Enhancements:

with_resources takes in a ScalingConfig (#30259)

Keep resource specifications when nesting with_resources in with_parameters (#29740)

Add trial_name_creator and trial_dirname_creator to TuneConfig (#30123)

Add option to not override the working directory (#29258)

Only convert a BaseTrainer to Trainable once in the Tuner (#30355)

Dynamically identify PyTorch Lightning Callback hooks (#30045)

Make remote_checkpoint_dir work with query strings (#30125)

Make cloud checkpointing retry configurable (#30111)

Sync experiment-checkpoints more often (#30187)

Update generate_id algorithm (#29900)

🔨 Fixes:

Catch SyncerCallback failure with dead node (#29438)

Do not warn in BayesOpt w/ Uniform sampler (#30350)

Fix ResourceChangingScheduler dropping PGF args (#30304)

Fix Jupyter output with Ray Client and Tuner (#29956)

Fix tests related to TUNE_ORIG_WORKING_DIR env variable (#30134)

📖Documentation:

Add user guide for analyzing results (using ResultGrid and Result) (#29072)

Tune checkpointing and Tuner restore docfix (#29411)

Fix and clean up PBT examples (#29060)

Fix TrialTerminationReporter in docs (#29254)

🏗 Architecture refactoring:

Remove hard deprecated SyncClient/Syncer (#30253)

Deprecate Wandb mixin, move to setup_wandb() function (#29828)

Ray Serve

🎉 New Features:

Guard for high latency requests (#29534)

Java API Support (blog)

💫Enhancements:

Serve K8s HA benchmarking (#30278)

Add method info for http metrics (#29918)

🔨 Fixes:

Fix log format error (#28760)

Inherit previous deployment num_replicas (29686)

Polish serve run deploy message (#29897)

Remove calling of get_event_loop from python 3.10

RLlib

🎉 New Features:

Fault tolerant, elastic WorkerSets: An asynchronous Ray Actor manager class is now used inside all of RLlib’s Algorithms, adding fully flexible fault tolerance to rollout workers and workers used for evaluation. If one or more workers (which are Ray actors) fails - e.g. due to a SPOT instance going down - the RLlib Algorithm will now flexibly wait it out and periodically try to recreate the failed workers. In the meantime, only the remaining healthy workers are used for sampling and evaluation. (#29938, #30118, #30334, #30252, #29703, #30183, #30327, #29953)

💫Enhancements:

RLlib CLI: A new and enhanced RLlib command line interface (CLI) has been added, allowing for automatically downloading example configuration files, python-based config files (defining an AlgorithmConfig object to use), better interoperability between training and evaluation runs, and many more. For a detailed overview of what has changed, check out the new CLI documentation. (#29204, #29459, #30526, #29661, #29972)

Checkpoint overhaul: Algorithm checkpoints and Policy checkpoints are now more cohesive and transparent. All checkpoints are now characterized by a directory (with files and maybe sub-directories), rather than a single pickle file; Both Algorithm and Policy classes now have a utility static method (from_checkpoint()) for directly instantiating instances from a checkpoint directory w/o knowing the original configuration used or any other information (having the checkpoint is sufficient). For a detailed overview, see here. (#28812, #29772, #29370, #29520, #29328)

A new metric for APPO/IMPALA/PPO has been added that measures off-policy’ness: The difference in number of grad-updates the sampler policy has received thus far vs the trained policy’s number of grad-updates thus far. (#29983)

🏗 Architecture refactoring:

AlgorithmConfig classes: All of RLlib’s Algorithms, RolloutWorkers, and other important classes now use AlgorithmConfig objects under the hood, instead of python config dicts. It is no longer recommended (however, still supported) to create a new algorithm (or a Tune+RLlib experiment) using a python dict as configuration. For more details on how to convert your scripts to the new AlgorithmConfig design, see here. (#29796, #30020, #29700, #29799, #30096, #29395, #29755, #30053, #29974, #29854, #29546, #30042, #29544, #30079, #30486, #30361)

Major progress was made on the new Connector API and making sure it can be used (tentatively) with the “config.rollouts(enable_connectors=True)” flag. Will be fully supported, across all of RLlib’s algorithms, in Ray 2.3. (#30307, #30434, #30459, #30308, #30332, #30320, #30383, #30457, #30446, #30024, #29064, #30398, #29385, #30481, #30241, #30285, #30423, #30288, #30313, #30220, #30159)

Progress was made on the upcoming RLModule/RLTrainer/RLOptimizer APIs. (#30135, #29600, #29599, #29449, #29642)

🔨 Fixes:

Various bug fixes: #25925, #30279, #30478, #30461, #29867, #30099, #30185, #29222, #29227, #29494, #30257, #29798, #30176, #29648, #30331

📖Documentation:

RLlib CLI, Checkpoint overhaul, AlgorithmConfigs

Minor fixes: #29261, #29752

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Out-of-memory monitor is now Beta and is enabled by default.

💫Enhancements:

The Ray Jobs API has graduated from Beta to GA. This means Ray Jobs will maintain API backward compatibility.

Run Ray job entrypoint commands (“driver scripts”) on worker nodes by specifying entrypoint_num_cpus, entrypoint_num_gpus, or entrypoint_resources. (#28564, #28203)

(Beta) OpenAPI spec for Ray Jobs REST API (#30417)

Improved Ray health checking mechanism. The fix will reduce the frequency of GCS marking raylets fail mistakenly when it is overloaded. (#29346, #29442, #29389, #29924)

🔨 Fixes:

Various fixes for hanging / deadlocking (#29491, #29763, #30371, #30425)

Set OMP_NUM_THREADS to num_cpus required by task/actors by default (#30496)

set worker non recyclable if gpu is envolved by default (#30061)

📖Documentation:

General improvements of Ray Core docs, including design patterns and tasks.

Ray Clusters

💫Enhancements:

Stability improvements for Ray Autoscaler / KubeRay Operator integration. (#29933 , #30281, #30502)

Dashboard

🎉 New Features:

Additional improvements from the default metrics dashboard. We now have actor, placement group, and per component memory usage breakdown. You can see details from the doc.

New profiling feature using py-spy under the hood. You can click buttons to see stack trace or cpu flame graphs of your workers.

Autoscaler and job events are available from the dashboard. You can also access the same data using ray list cluster-events.

🔨 Fixes:

Stability improvements from the dashboard

Dashboard now works at large scale cluster! It is tested with 250 nodes and 10K+ actors (which matches the Ray scalability envelope).

Smarter api fetching logic. We now wait for the previous API to finish before sending a new API request when polling for new data.

Fix agent memory leak and high CPU usage.

💫Enhancements:

General improvements to the progress bar. You can now see progress bars for each task name if you drill into the job details.

More metadata is available in the jobs and actors tables.

There is now a feedback button embedded into the dashboard. Please submit any bug reports or suggestions!

Many thanks to all those who contributed to this release!

@shrekris-anyscale, @rickyyx, @scottjlee, @shogohida, @liuyang-my, @matthewdeng, @wjrforcyber, @linusbiostat, @clarkzinzow, @justinvyu, @zygi, @christy, @amogkam, @cool-RR, @jiaodong, @EvgeniiTitov, @jjyao, @ilee300a, @jianoaix, @rkooo567, @mattip, @maxpumperla, @ericl, @cadedaniel, @bveeramani, @rueian, @stephanie-wang, @lcipolina, @bparaj, @JoonHong-Kim, @avnishn, @tomsunelite, @larrylian, @alanwguo, @VishDev12, @c21, @dmatrix, @xwjiang2010, @thomasdesr, @tiangolo, @sokratisvas, @heyitsmui, @scv119, @pcmoritz, @bhavika, @yzs981130, @andraxin, @Chong-Li, @clarng, @acxz, @ckw017, @krfricke, @kouroshHakha, @sijieamoy, @iycheng, @gjoliver, @peytondmurray, @xcharleslin, @DmitriGekhtman, @andreichalapco, @vitrioil, @architkulkarni, @simon-mo, @ArturNiederfahrenhorst, @sihanwang41, @pabloem, @sven1977, @avivhaber, @wuisawesome, @jovany-wang, @Yard1
Source code(tar.gz)
Source code(zip)
ray-2.1.0(Nov 8, 2022)
Release Highlights

Ray AI Runtime (AIR)

Better support for Image-based workloads.

Ray Datasets read_images() API for loading data.

Numpy-based API for user-defined functions in Preprocessor.

Ability to read TFRecord input.

Ray Datasets read_tfrecords() API to read TFRecord files.

Ray Serve:

Add support for gRPC endpoint (alpha release). Instead of using an HTTP server, Ray Serve supports gRPC protocol and users can bring their own schema for their use case.

RLlib:

Introduce decision transformer (DT) algorithm.

New hook for callbacks with on_episode_created().

Learning rate schedule to SimpleQ and PG.

Ray Core:

Ray OOM prevention (alpha release).

Support dynamic generators as task return values.

Dashboard:

Time series metrics support.

Export configuration files can be used in Prometheus or Grafana instances.

New progress bar in job detail view.

Ray Libraries

Ray AIR

💫Enhancements:

Improve readability of training failure output (#27946, #28333, #29143)

Auto-enable GPU for Predictors (#26549)

Add ability to create TorchCheckpoint from state dict (#27970)

Add ability to create TensorflowCheckpoint from saved model/h5 format (#28474)

Add attribute to retrieve URI from Checkpoint (#28731)

Add all allowable types to WandB Callback (#28888)

🔨 Fixes:

Handle nested metrics properly as scoring attribute (#27715)

Fix serializability of Checkpoints (#28387, #28895, #28935)

📖Documentation:

Miscellaneous updates to documentation and examples (#28067, #28002, #28189, #28306, #28361, #28364, #28631, #28800)

🏗 Architecture refactoring:

Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (#28318)

Deprecate legacy train/tune functions in favor of Session (#28856)

Ray Data Processing

🎉 New Features:

Add read_images (#29177)

Add read_tfrecords (#28430)

Add NumPy batch format to Preprocessor and BatchMapper (#28418)

Ragged tensor extension type (#27625)

Add KBinsDiscretizer Preprocessor (#28389)

💫Enhancements:

Simplify to_tf interface (#29028)

Add metadata override and inference in Dataset.to_dask() (#28625)

Prune unused columns before aggregate (#28556)

Add Dataset.default_batch_format (#28434)

Add partitioning parameter to read_ functions (#28413)

Deprecate "native" batch format in favor of "default" (#28489)

Support None partition field name (#28417)

Re-enable Parquet sampling and add progress bar (#28021)

Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded (#27964)

Customized serializer for Arrow JSON ParseOptions in read_json (#27911)

Optimize groupby/mapgroups performance (#27805)

Improve size estimation of image folder data source (#27219)

Use detached lifetime for stats actor (#25271)

Pin _StatsActor to the driver node (#27765)

Better error message for partition filtering if no file found (#27353)

Make Concatenator deterministic (#27575)

Change FeatureHasher input schema to expect token counts (#27523)

Avoid unnecessary reads when truncating a dataset with ds.limit() (#27343)

Hide tensor extension from UDFs (#27019)

Add repr to AIR classes (#27006)

🔨 Fixes:

Add upper bound to pyarrow version check (#29674) (#29744)

Fix map_groups to work with different output type (#29184)

read_csv not filter out files by default (#29032)

Check columns when adding rows to TableBlockBuilder (#29020)

Fix the peak memory usage calculation (#28419)

Change sampling to use same API as read Parquet (#28258)

Fix column assignment in Concatenator for Pandas 1.2. (#27531)

Doing partition filtering in reader constructor (#27156)

Fix split ownership (#27149)

📖Documentation:

Clarify dataset transformation. (#28482)

Update map_batches documentation (#28435)

Improve docstring and doctest for read_parquet (#28488)

Activate dataset doctests (#28395)

Document using a different separator for read_csv (#27850)

Convert custom datetime column when reading a CSV file (#27854)

Improve preprocessor documentation (#27215)

Improve limit() and take() docstrings (#27367)

Reorganize the tensor data support docs (#26952)

Fix nyc_taxi_basic_processing notebook (#26983)

Ray Train

🎉 New Features:

Add FullyShardedDataParallel support to TorchTrainer (#28096)

💫Enhancements:

Add rich notebook repr for DataParallelTrainer (#26335)

Fast fail if training loop raises an error on any worker (#28314)

Use torch.encode_data with HorovodTrainer when torch is imported (#28440)

Automatically set NCCL_SOCKET_IFNAME to use ethernet (#28633)

Don't add Trainer resources when running on Colab (#28822)

Support large checkpoints and other arguments (#28826)

🔨 Fixes:

Fix and improve HuggingFaceTrainer (#27875, #28154, #28170, #28308, #28052)

Maintain dtype info in LightGBMPredictor (#28673)

Fix prepare_model (#29104)

Fix train.torch.get_device() (#28659)

📖Documentation:

Clarify LGBM/XGB Trainer documentation (#28122)

Improve Hugging Face notebook example (#28121)

Update Train API reference and docs (#28192)

Mention FSDP in HuggingFaceTrainer docs (#28217)

🏗 Architecture refactoring:

Improve Trainer modularity for extensibility (#28650)

Ray Tune

🎉 New Features:

Add Tuner.get_results() to retrieve results after restore (#29083)

💫Enhancements:

Exclude files in sync_dir_between_nodes, exclude temporary checkpoints (#27174)

Add rich notebook output for Tune progress updates (#26263)

Add logdir to W&B run config (#28454)

Improve readability for long column names in table output (#28764)

Add functionality to recover from latest available checkpoint (#29099)

Add retry logic for restoring trials (#29086)

🔨 Fixes:

Re-enable progress metric detection (#28130)

Add timeout to retry_fn to catch hanging syncs (#28155)

Correct PB2’s beta_t parameter implementation (#28342)

Ignore directory exists errors to tackle race conditions (#28401)

Correctly overwrite files on restore (#28404)

Disable pytorch-lightning multiprocessing per default (#28335)

Raise error if scheduling an empty PlacementGroupFactory#28445

Fix trial cleanup after x seconds, set default to 600 (#28449)

Fix trial checkpoint syncing after recovery from other node (#28470)

Catch empty hyperopt search space, raise better Tuner error message (#28503)

Fix and optimize sample search algorithm quantization logic (#28187)

Support tune.with_resources for class methods (#28596)

Maintain consistent Trial/TrialRunner state when pausing and resuming trial with PBT (#28511)

Raise better error for incompatible gcsfs version (#28772)

Ensure that exploited in-memory checkpoint is used by trial with PBT (#28509)

Fix Tune checkpoint tracking for minimizing metrics (#29145)

📖Documentation:

Miscelleanous documentation fixes (#27117, #28131, #28210, #28400, #28068, #28809)

Add documentation around trial/experiment checkpoint (#28303)

Add basic parallel execution guide for Tune (#28677)

Add example PBT notebook (#28519)

🏗 Architecture refactoring:

Store SyncConfig and CheckpointConfig in Experiment and Trial (#29019)

Ray Serve

🎉 New Features:

Added gRPC direct ingress support [alpha version] (#28175)

Serve cli can provide kubernetes formatted output (#28918)

Serve cli can provide user config output without default value (#28313)

💫Enhancements:

Enrich more benchmarks

image objection with resnet50 mode with image preprocessing (#29096)

gRPC vs HTTP inference performance (#28175)

Add health check metrics to reflect the replica health status (#29154)

🔨 Fixes:

Fix memory leak issues during inference (#29187)

Fix unexpected http options omit warning when using serve cli to start the ray serve (#28257)

Fix unexpected long poll exceptions (#28612)

📖Documentation:

Add e2e fault tolerance instructions (#28721)

Add Direct Ingress instructions (#29149)

Bunch of doc improvements on “dev workflow”, “custom resources”, “serve cli” etc (#29147, #28708, #28529, #28527)

RLlib

🎉 New Features:

Decision Transformer (DT) Algorithm added (#27890, #27889, #27872, #27829).

Callbacks now have a new hook on_episode_created(). (#28600)

Added learning rate schedule to SimpleQ and PG. (#28381)

💫Enhancements:

Soft target network update is now supported by all off-policy algorithms (e.g DQN, DDPG, etc.) (#28135)

Stop RLlib from "silently" selecting atari preprocessors. (#29011)

Improved offline RL and off-policy evaluation performance (#28837, #28834, #28593, #28420, #28136, #28013, #27356, #27161, #27451).

Escalated old deprecation warnings to errors (#28807, #28795, #28733, #28697).

Others: #27619, #27087.

🔨 Fixes:

Various bug fixes: #29077, #28811, #28637, #27785, #28703, #28422, #28405, #28358, #27540, #28325, #28357, #28334, #27090, #28133, #27981, #27980, #26666, #27390, #27791, #27741, #27424, #27544, #27459, #27572, #27255, #27304, #26629, #28166, #27864, #28938, #28845, #28588, #28202, #28201, #27806

📖Documentation:

Connectors. (#27528)

Training step API. (#27344)

Others: #28299, #27460

Ray Workflows

🔨 Fixes:

Fixed the object loss due to driver exit (#29092)

Change the name in step to task_id (#28151)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Ray OOM prevention feature alpha release! If your Ray jobs suffer from OOM issues, please give it a try.

Support dynamic generators as task return values. (#29082 #28864 #28291)

💫Enhancements:

Fix spread scheduling imbalance issues (#28804 #28551 #28551)

Widening range of grpcio versions allowed (#28623)

Support encrypted redis connection. (#29109)

Upgrade redis from 6.x to 7.0.5. (#28936)

Batch ScheduleAndDispatchTasks calls (#28740)

🔨 Fixes:

More robust spilled object deletion (#29014)

Fix the initialization/destruction order between reference_counter_ and node change subscription (#29108)

Suppress the logging error when python exits and actor not deleted (#27300)

Mark run_function_on_all_workers as deprecated until we get rid of this (#29062)

Remove unused args for default_worker.py (#28177)

Don't include script directory in sys.path if it's started via python -m (#28140)

Handling edge cases of max_cpu_fraction argument (#27035)

Fix out-of-band deserialization of actor handle (#27700)

Allow reuse of cluster address if Ray is not running (#27666)

Fix a uncaught exception upon deallocation for actors (#27637)

Support placement_group=None in PlacementGroupSchedulingStrategy (#27370)

📖Documentation:

Ray 2.0 white paper is published.

Revamp ray core docs (#29124 #29046 #28953 #28840 #28784 #28644 #28345 #28113 #27323 #27303)

Fix cluster docs (#28056 #27062)

CLI Reference Documentation Revamp (#27862)

Ray Clusters

💫Enhancements:

Distinguish Kubernetes deployment stacks (#28490)

📖Documentation:

State intent to remove legacy Ray Operator (#29178)

Improve KubeRay migration notes (#28672)

Add FAQ for cluster multi-tenancy support (#29279)

Dashboard

🎉 New Features:

Time series metrics are now built into the dashboard

Ray now exports some default configuration files which can be used for your Prometheus or Grafana instances. This includes default metrics which show common information important to your Ray application.

New progress bar is shown in the job detail view. You can see how far along your ray job is.

🔨 Fixes:

Fix to prometheus exporter producing a slightly incorrect format.

Fix several performance issues and memory leaks

📖Documentation:

Added additional documentation on the new time series and the metrics page

Many thanks to all those who contributed to this release!

@sihanwang41, @simon-mo, @avnishn, @MyeongKim, @markrogersjr, @christy, @xwjiang2010, @kouroshHakha, @zoltan-fedor, @wumuzi520, @alanwguo, @Yard1, @liuyang-my, @charlesjsun, @DevJake, @matteobettini, @jonathan-conder-sm, @mgerstgrasser, @guidj, @JiahaoYao, @Zyiqin-Miranda, @jvanheugten, @aallahyar, @SongGuyang, @clarng, @architkulkarni, @Rohan138, @heyitsmui, @mattip, @ArturNiederfahrenhorst, @maxpumperla, @vale981, @krfricke, @DmitriGekhtman, @amogkam, @richardliaw, @maldil, @zcin, @jianoaix, @cool-RR, @kira-lin, @gramhagen, @c21, @jiaodong, @sijieamoy, @tupui, @ericl, @anabranch, @se4ml, @suquark, @dmatrix, @jjyao, @clarkzinzow, @smorad, @rkooo567, @jovany-wang, @edoakes, @XiaodongLv, @klieret, @rozsasarpi, @scottsun94, @ijrsvt, @bveeramani, @chengscott, @jbedorf, @kevin85421, @nikitavemuri, @sven1977, @acxz, @stephanie-wang, @PaulFenton, @WangTaoTheTonic, @cadedaniel, @nthai, @wuisawesome, @rickyyx, @artemisart, @peytondmurray, @pingsutw, @olipinski, @davidxia, @stestagg, @yaxife, @scv119, @mwtian, @yuanchi2807, @ntlm1686, @shrekris-anyscale, @cassidylaidlaw, @gjoliver, @ckw017, @hakeemta, @ilee300a, @avivhaber, @matthewdeng, @afarid, @pcmoritz, @Chong-Li, @Catch-Bull, @justinvyu, @iycheng
Source code(tar.gz)
Source code(zip)
ray-2.0.1(Oct 22, 2022)
The Ray 2.0.1 patch release contains dependency upgrades and fixes for multiple components:

Upgrade grpcio version to 1.32 (#28025)

Upgrade redis version to 7.0.5 (#28936)

Fix segfault when using runtime environments (#28409)

Increase RPC timeout for dashboard (#28330)

Set correct path when using python -m (#28140)

[Autoscaler] Fix autoscaling for 0 CPU head node (#26813)

[Serve] Allow code in private remote Git URIs to be imported (#28250)

[Serve] Allow host and port in Serve config (#27026)

[RLlib] Evaluation supports asynchronous rollout (single slow eval worker will not block the overall evaluation progress). (#27390)

[Tune] Fix hang during checkpoint synchronization (#28155)

[Tune] Fix trial restoration from different IP (#28470)

[Tune] Fix custom synchronizer serialization (#28699)

[Workflows] Replace deprecated name option with task_id (#28151)

Source code(tar.gz)
Source code(zip)
ray-2.0.0(Aug 23, 2022)
Release Highlights

Ray 2.0 is an exciting release with enhancements to all libraries in the Ray ecosystem. With this major release, we take strides towards our goal of making distributed computing scalable, unified, and open.

Towards these goals, Ray 2.0 features new capabilities for unifying the machine learning (ML) ecosystem, improving Ray's production support, and making it easier than ever for ML practitioners to use Ray's libraries.

Highlights:

Ray AIR, a scalable and unified toolkit for ML applications, is now in Beta.

Ray now supports natively shuffling 100TB or more of data with the Ray Datasets library.

KubeRay, a toolkit for running Ray on Kubernetes, is now in Beta. This replaces the legacy Python-based Ray operator.

Ray Serve’s Deployment Graph APIis a new and easier way to build, test, and deploy an inference graph of deployments. This is released as Beta in 2.0.

A migration guide for all the different libraries can be found here: Ray 2.0 Migration Guide.

Ray Libraries

Ray AIR

Ray AIR is now in beta. Ray AIR builds upon Ray’s libraries to enable end-to-end machine learning workflows and applications on Ray. You can install all dependencies needed for Ray AIR via pip install -u "ray[air]".

🎉 New Features:

Predictors:

BatchPredictors now have support for scalable inference on GPUs.

All Predictors can now be constructed from pre-trained models, allowing you to easily scale batch inference with trained models from common ML frameworks.

ray.ml.predictors has been moved to the Ray Train namespace (ray.train).

Preprocessing: New preprocessors and API changes on Ray Datasets now make feature processing easier to do on AIR. See the Ray Data release notes for more details.

New features for Datasets/Train/Tune/Serve can be found in the corresponding library release notes for more details.

💫 Enhancements:

Major package refactoring is included in this release.

ray.ml is renamed to ray.air.

ray.ml.preprocessors have been moved to ray.data.

train_test_split is now a new method of ray.data.Dataset (#27065)

ray.ml.trainers have been moved to ray.train (#25570)

ray.ml.predictors has been moved to ray.train.

ray.ml.config has been moved to ray.air.config (#25712).

Checkpoints are now framework-specific -- meaning that each Trainer generates its own Framework-specific Checkpoint class. See Ray Train for more details.

ModelWrappers have been renamed to PredictorDeployments.

API stability annotations have been added (#25485)

Train/Tune now have the same reporting and checkpointing API -- see the Train notes for more details (#26303)

ScalingConfigs are now Dataclasses not Dict types

Many AIR examples, benchmarks, and documentation pages were added in this release. The Ray AIR documentation will cover breadth of usage (end to end workflows across different libraries) while library-specific documentation will cover depth (specific features of a specific library).

🔨 Fixes:

Many documentation examples were previously untested. This release fixes those examples and adds them to the CI.

Predictors:

Torch/Tensorflow Predictors have correctness fixes (#25199, #25190, #25138, #25136)

Update KerasCallback to work with TensorflowPredictor (#26089)

Add streaming BatchPredictor support (#25693)

Add predict_pandas implementation (#25534)

Add _predict_arrow interface for Predictor (#25579)

Allow creating Predictor directly from a UDF (#26603)

Execute GPU inference in a separate stage in BatchPredictor (#26616, #27232, #27398)

Accessors for preprocessor in Predictor class (#26600)

[AIR] Predictor call_model API for unsupported output types (#26845)

Ray Data Processing

🎉 New Features:

Add ImageFolderDatasource (#24641)

Add the NumPy batch format for batch mapping and batch consumption (#24870)

Add iter_torch_batches() and iter_tf_batches() APIs (#26689)

Add local shuffling API to iterators (#26094)

Add drop_columns() API (#26200)

Add randomize_block_order() API (#25568)

Add random_sample() API (#24492)

Add support for len(Dataset) (#25152)

Add UDF passthrough args to map_batches() (#25613)

Add Concatenator preprocessor (#26526)

Change range_arrow() API to range_table() (#24704)

💫 Enhancements:

Autodetect dataset parallelism based on available resources and data size (#25883)

Use polars for sorting (#25454)

Support tensor columns in to_tf() and to_torch() (#24752)

Add explicit resource allocation option via a top-level scheduling strategy (#24438)

Spread actor pool actors evenly across the cluster by default (#25705)

Add ray_remote_args to read_text() (#23764)

Add max_epoch argument to iter_epochs() (#25263)

Add Pandas-native groupby and sorting (#26313)

Support push-based shuffle in groupby operations (#25910)

More aggressive memory releasing for Dataset and DatasetPipeline (#25461, #25820, #26902, #26650)

Automatically cast tensor columns on Pandas UDF outputs (#26924)

Better error messages when reading from S3 (#26619, #26669, #26789)

Make dataset splitting more efficient and stable (#26641, #26768, #26778)

Use sampling to estimate in-memory data size for Parquet data source (#26868)

De-experimentalized lazy execution mode (#26934)

🔨 Fixes:

Fix pipeline pre-repeat caching (#25265)

Fix stats construction for from_*() APIs (#25601)

Fixes label tensor squeezing in to_tf() (#25553)

Fix stage fusion between equivalent resource args (fixes BatchPredictor) (#25706)

Fix tensor extension string formatting (repr) (#25768)

Workaround for unserializable Arrow JSON ReadOptions (#25821)

Make ActorPoolStrategy kill pool of actors if exception is raised (#25803)

Fix max number of actors for default actor pool strategy (#26266)

Fix byte size calculation for non-trivial tensors (#25264)

Ray Train

Ray Train has received a major expansion of scope with Ray 2.0.

In particular, the Ray Train module now contains:

Trainers

Predictors

Checkpoints

for common different ML frameworks including Pytorch, Tensorflow, XGBoost, LightGBM, HuggingFace, and Scikit-Learn. These API help provide end-to-end usage of Ray libraries in Ray AIR workflows.

🎉 New Features:

The Trainer API is now deprecated for the new Ray AIR Trainers API. Trainers for Pytorch, Tensorflow, Horovod, XGBoost, and LightGBM are now in Beta. (#25570)

ML framework-specific Predictors have been moved into the ray.train namespace. This provides streamlined API for offline and online inference of Pytorch, Tensorflow, XGBoost models and more. (#25769 #26215, #26251, #26451, #26531, #26600, #26603, #26616, #26845)

ML framework-specific checkpoints are introduced. Checkpoints are consumed by Predictors to load model weights and information. (#26777, #25940, #26532, #26534)

💫 Enhancements:

Train and Tune now use the same reporting and checkpointing API (#24772, #25558)

Add tunable ScalingConfig dataclass (#25712)

Randomize block order by default to avoid hotspots (#25870)

Improve checkpoint configurability and extend results (#25943)

Improve prepare_data_loader to support multiple batch data types (#26386)

Discard returns of train loops in Trainers (#26448)

Clean up logs, reprs, warning s(#26259, #26906, #26988, #27228, #27519)

📖 Documentation:

Update documentation to use new Train API (#25735)

Update documentation to use session API (#26051, #26303)

Add Trainer user guide and update Trainer docs (#27570, #27644, #27685)

Add Predictor documentation (#25833)

Replace to_torch with iter_torch_batches (#27656)

Replace to_tf with iter_tf_batches (#27768)

Minor doc fixes (#25773, #27955)

🏗 Architecture refactoring:

Clean up ray.train package (#25566)

Mark Trainer interfaces as Deprecated (#25573)

🔨 Fixes:

An issue with GPU ID detection and assignment was fixed. (#26493)

Fix AMP for models with a custom __getstate__ method (#25335)

Fix transformers example for multi-gpu (#24832)

Fix ScalingConfig key validation (#25549)

Fix ResourceChangingScheduler integration (#26307)

Fix auto_transfer cuda device (#26819)

Fix BatchPredictor.predict_pipelined not working with GPU stage (#27398)

Remove rllib dependency from tensorflow_predictor (#27688)

Ray Tune

🎉 New Features:

The Tuner API is the new way of running Ray Tune experiments. (#26987, #26987, #26961, #26931, #26884, #26930)

Ray Tune and Ray Train now have the same API for reporting (#25558)

Introduce tune.with_resources() to specify function trainable resources (#26830)

Add Tune benchmark for AIR (#26763, #26564)

Allow Tuner().restore() from cloud URIs (#26963)

Add top-level imports for Tuner, TuneConfig, move CheckpointConfig (#26882)

Add resume experiment options to Tuner.restore() (#26826)

Add checkpoint_frequency/checkpoint_at_end arguments to CheckpointConfig (#26661)

Add more config arguments to Tuner (#26656)

Better error message for Tune nested tasks / actors (#25241)

Allow iterators in tune.grid_search (#25220)

Add get_dataframe() method to result grid, fix config flattening (#24686)

💫 Enhancements:

Expose number of errored/terminated trials in ResultGrid (#26655)

remove fully_executed from Tune. (#25750)

Exclude in remote storage upload (#25544)

Add TempFileLock (#25408)

Add annotations/set scope for Tune classes (#25077)

📖 Documentation:

Improve Tune + Datasets documentation (#25389)

Tune examples better navigation, minor fixes (#24733)

🏗 Architecture refactoring:

Consolidate checkpoint manager 3: Ray Tune (#24430)

Clean up ray.tune scope (remove stale objects in all) (#26829)

🔨 Fixes:

Fix k8s release test + node-to-node syncing (#27365)

Fix Tune custom syncer example (#27253)

Fix tune_cloud_aws_durable_upload_rllib_* release tests (#27180)

Fix test_tune (#26721)

Larger head node for tune_scalability_network_overhead weekly test (#26742)

Fix tune-sklearn notebook example (#26470)

Fix reference to dataset_tune (#25402)

Fix Tune-Pytorch-CIFAR notebook example (#26474)

Fix documentation testing (#26409)

Fix set_tune_experiment (#26298)

Fix GRPC resource exhausted test for tune trainables (#24467)

Ray Serve

🎉 New Features:

We are excited to introduce you to the 2.0 API centered around multi-model composition API, operation API, and production stability. (#26310,#26507,#26217,#25932,#26374,#26901,#27058,#24549,#24616,#27479,#27576,#27433,#24306,#25651,#26682,#26521,#27194,#27206,#26804,#25575,#26574)

Deployment Graph API is the new API for model composition. It provides a declarative layer on top of the 1.x deployment API to help you author performant inference pipeline easily. (#27417,#27420,#24754,#24435,#24630,#26573,#27349,#24404,#25424,#24418,#27815,#27844,#25453,#24629)

We introduced a new K8s native way to deploy Ray Serve. Along with a brand new REST API to perform deployment, update, and configure. (#25935,#27063,#24814,#26093,#25213,#26588,#25073,#27000,#27444,#26578,#26652,#25610,#25502,#26096,#24265,#26177,#25861,#25691,#24839,#27498,#27561,#25862,#26347)

Serve can now survive Ray GCS failure. This used to be a single-point-of-failure in Ray Serve's architecture. Now, when the GCS goes down, Serve can continue to Serve traffic. We recommend you to try out this feature and give us feedback! (#25633,#26107,#27608,#27763,#27771,#25478,#25637,#27526,#27674,#26753,#26797,#24560,#26685,#26734,#25987,#25091,#24934)

Autoscaling has been promoted to stable. Additionally, we added a scale to zero support. (#25770,#25733,#24892,#26393)

The documentation has been revamped. Check them at rayserve.org (#24414,#26211,#25786,#25936,#26029,#25830,#24760,#24871,#25243,#25390,#25646,#24657,#24713,#25270,#25808,#24693,#24736,#24524,#24690,#25494)

💫 Enhancements:

Serve natively supports deploying predictor and checkpoints from Ray AI Runtime (#26026,#25003,#25537,#25609,#25962,#26494,#25688,#24512,#24417)

Serve now supports scaling Gradio application (#27560)

Java Client API, marking the complete alpha release Java API (#22726)

Improved out-of-box performance by using uvicorn with uvloop (#25027)

RLlib

🎉 New Features:

In 2.0, RLlib is introducing an object-oriented configuration API instead of using a python dict for algorithm configuration (#24332, #24374, #24375, #24376, #24433, #24576, #24650, #24577, #24339, #24687, #24775, #24584, #24583, #24853, #25028, #25059, #25065, #25066, #25067, #25256, #25255, #25278, #25279)

RLlib is introducing a Connectors API (alpha). Connectors are a new component that handles transformations on inputs and outputs of a given RL policy. (#25311, #25007, #25923, #25922, #25954, #26253, #26510, #26645, #26836, #26803, #26998, #27016)

New improvements to off-policy estimators, including a new Doubly-Robust Off-Policy Estimator implementation (#24384, #25107, #25056, #25899, #25911, #26279, #26893)

CRR Algorithm (#25459, #25667, #25905, #26142, #26304, #26770, #27161)

Feature importance evaluation for offline RL (#26412)

RE3 exploration algorithm TF2 framework support (#25221)

Unified replay Buffer API (#24212, #24156, #24473, #24506, #24866, #24683, #25841, #25560, #26428)

💫 Enhancements:

Improvements to RolloutWorker / Env fault tolerance (#24967, #26134, #26276, #26809)

Upgrade gym to 0.23 (#24171), Bump gym dep to 0.24 (#26190)

Agents has been renamed to Algorithms (#24511, #24516, #24739, #24797, #24841, #24896, #25014, #24579, #25314, #25346, #25366, #25539, #25869)

Execution Plan API is now deprecated. Training step function API is the new way of specifying RLlib algorithms (#23454, #24488, #2450, #24212, #24165, #24545, #24507, #25076, #25624, #25924, #25856, #25851, #27344, #24423)

Policy V2 subclassing implementation migration (#24742, #24746, #24914, #25117, #25203, #25078, #25254, #25384, #25585, #25871, #25956, #26054)

Allow passing **kwargs to action distribution. (#24692)

Deprecation: Replace remaining evaluation_num_episodes with evaluation_duration. (#26000)

🔨 Fixes:

Multi-GPU learner thread key error in MA-scenarios (#24382)

Add release learning tests for SlateQ (#24429)

APEX-DQN replay buffer config validation fix. (#24588)

Automatic sequencing in function timeslice_along_seq_lens_with_overlap (#24561)

Policy Server/Client metrics reporting fix (#24783)

Re-establish dashboard performance tests. (#24728)

Bandit tf2 fix (+ add tf2 to test cases). (#24908)

Fix estimated buffer size in replay buffers. (#24848)

Fix RNNSAC example failing on CI + fixes for recurrent models for other Q Learning Algos. (#24923)

Curiosity bug fix. (#24880)

Auto-infer different agents' spaces in multi-agent env. (#24649)

Fix the bug “WorkerSet.stop() will raise error if self._local_worker is None (e.g. in evaluation worker sets)”. (#25332)

Fix Policy global timesteps being off by init sample batch size. (#25349)

Disambiguate timestep fragment storage unit in replay buffers. (#25242)

Fix the bug where on GPU, sample_batch.to_device() only converts the device and does not convert float64 to float32. (#25460)

Fix faulty usage of get_filter_config in ComplexInputNextwork(#25493)

Custom resources per worker should get added to default_resource_request (#24463)

Better default values for training_intensity and target_network_update_freq for R2D2. (#25510)

Fix multi agent environment checks for observations that contain only some agents' obs each step. (#25506)

Fixes PyTorch grad clipping logic and adds grad clipping to QMIX. (#25584)

Discussion 6432: Automatic train_batch_size calculation fix. (#25621)

Added meaningful error for multi-agent failure of SampleCollector in case no agent steps in episode. (#25596)

Replace torch.range with torch.arange. (#25640)\

Fix the bug where there is no gradient clipping in QMix. (#25656)

Fix sample batch concatination. (#25572)

Fix action_sampler_fn call in TorchPolicyV2 (obs_batch instead of input_dict arg). (#25877)

Fixes logging of all of RLlib's Algorithm names as warning messages. (#25840)

IMPALA/APPO multi-agent mix-in-buffer fixes (plus MA learningt ests). (#25848)

Move offline input into replay buffer using rollout ops in CQL. (#25629)

Include SampleBatch.T column in all collected batches. (#25926)

Add timeout to filter synchronization. (#25959)

SimpleQ PyTorch Multi GPU fix (#26109)

IMPALA and APPO metrics fixes; remove deprecated async_parallel_requests utility. (#26117)

Added 'episode.hist_data' to the 'atari_metrics' to nsure that custom metrics of the user are kept in postprocessing when using Atari environments. (#25292)

Make the dataset and json readers batchable (#26055)

Fix Issue 25696: Output writers not working w/ multiple workers. (#25722)

Fix all the erroneous on_trainer_init warning. (#26433)

In env check, step only expected agents. (#26425)

Make DQN update_target use only trainable variables. (#25226)

Fix FQE Policy call (#26671)

Make queue placement ops blocking (#26581)

Fix memory leak in APEX_DQN (#26691)

Fix MultiDiscrete not being one-hotted correctly (#26558)

Make IOContext optional for DatasetReader (#26694)

Make sure we step() after adding init_obs. (#26827)

Fix ModelCatalog for nested complex inputs (#25620)

Use compress observations where replay buffers and image obs are used in tuned examples (#26735)

Fix SampleBatch.split_by_episode to use dones if episode id is not available (#26492)

Fix torch None conversion in torch_utils.py::convert_to_torch_tensor. (#26863)

Unify gnorm mixin for tf and torch policies. (#26102)

Ray Workflows

🎉 New Features:

Support ray client (#26702)

Http event is supported (#26010)

Support retry_exceptions (#26913)

Support queuing in workflow (#24697)

Make status indexed (#24767)

🔨 Fixes:

Push logs to drivers correctly (#24490)

Make resume no side effect (#26918)

Make the max_retries aligned with ray (#26350)

🏗 Architecture refactoring:

Rewrite workflow execution engine (#25618)

Simplify the resume flow (#24594)

Deprecate step and use bind (#26232)

Deprecate virtual actor (#25394)

Refactor the exception processing (#26398)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Ray State API is now at alpha. You can access the live information of tasks, actors, objects, placement groups, and etc. through Ray CLI (summary / list / get) and Python SDK. See the Ray State API documentation for more information.

Support generators for tasks with multiple return values (#25247)

Support GCS Fault tolerance.(#24764, #24813, #24887, #25131, #25126, #24747, #25789, #25975, #25994, #26405, #26421, #26919)

💫 Enhancements:

Allow failing new tasks immediately while the actor is restarting (#22818)

Add more accurate worker exit (#24468)

Allow user to override global default for max_retries (#25189)

Export additional metrics for workers and Raylet memory (#25418)

Push message to driver when a Raylet dies (#25516)

Out of Disk prevention (#25370)

ray.init defaults to an existing Ray instance if there is one (#26678)

Reconstruct manually freed objects (#27567)

🔨 Fixes:

Fix a task cancel hanging bug (#24369)

Adjust worker OOM scores to prioritize the raylet during memory pressure (#24623)

Fix pull manager deadlock due to object reconstruction (#24791)

Fix bugs in data locality aware scheduling (#25092)

Fix node affinity strategy when resource is empty (#25344)

Fix object transfer resend protocol (#26349)

🏗 Architecture refactoring:

Raylet and GCS schedulers share the same code (#23829)

Remove multiple core workers in one process (#24147, #25159)

Ray Clusters

🎉 New Features:

The KubeRay operator is now the preferred tool to run Ray on Kubernetes.

Ray Autoscaler + KubeRay operator integration is now beta.

💫 Enhancements:

Check out the newly revamped docs!

🔨 Fixes:

Previously deprecated fields, head_node, worker_nodes, head_node_type, default_worker_node_type, autoscaling_mode, target_utilization_fraction are removed. Check out the migration guideto learn how to migrate to the new versions.

Ray Client

🎉 New Features:

Support for configuring request metadata for client gRPC (#24946)

💫 Enhancements:

Remove 2 GiB size limit on remote function arguments (#24555)

🔨 Fixes:

Fix excessive memory usage when submitting large remote arguments (#24477)

Dashboard

🎉 New Features:

The new dashboard UI is now to default dashboard. Please leave any feedback about the dashboard on Github Issues or Discourse! You can still go to the legacy dashboard UI by clicking “Back to legacy dashboard”.

New Dashboard UI now shows all ray jobs. This includes jobs submitted via the job submission API and jobs launched from python scripts via ray.init().

New Dashboard UI now shows worker nodes in the main node tab

New Dashboard UI now shows more information in the actors tab

Breaking changes:

The job submission list_jobs API endpoint, CLI command, and SDK function now returns a list of jobs instead of a dictionary from id to job.

The Tune tab is no longer in the new dashboard UI. It is still available in the legacy dashboard UI but will be removed.

The memory tab is no longer in the new dashboard UI. It is still available in the legacy dashboard UI but will be removed.

🔨 Fixes:

We reduced the memory usage of the dashboard. We are no longer caching logs and we cache a maximum of 1000 actors. As a result of this change, node level logs can no longer be accessed in the legacy dashboard.

Jobs status error message now properly truncates logs to 10 lines. We also added a max characters of 20000 to avoid passing too much data.

Many thanks to all those who contributed to this release!

@ujvl, @xwjiang2010, @EricCousineau-TRI, @ijrsvt, @waleedkadous, @captain-pool, @olipinski, @danielwen002, @amogkam, @bveeramani, @kouroshHakha, @jjyao, @larrylian, @goswamig, @hanming-lu, @edoakes, @nikitavemuri, @enori, @grechaw, @truelegion47, @alanwguo, @sychen52, @ArturNiederfahrenhorst, @pcmoritz, @mwtian, @vakker, @c21, @rberenguel, @mattip, @robertnishihara, @cool-RR, @iamhatesz, @ofey404, @raulchen, @nmatare, @peterghaddad, @n30111, @fkaleo, @Riatre, @zhe-thoughts, @lchu-ibm, @YoelShoshan, @Catch-Bull, @matthewdeng, @VishDev12, @valtab, @maxpumperla, @tomsunelite, @fwitter, @liuyang-my, @peytondmurray, @clarkzinzow, @VeronikaPolakova, @sven1977, @stephanie-wang, @emjames, @Nintorac, @suquark, @javi-redondo, @xiurobert, @smorad, @brucez-anyscale, @pdames, @jjyyxx, @dmatrix, @nakamasato, @richardliaw, @juliusfrost, @anabranch, @christy, @Rohan138, @cadedaniel, @simon-mo, @mavroudisv, @guidj, @rkooo567, @orcahmlee, @lixin-wei, @neigh80, @yuduber, @JiahaoYao, @simonsays1980, @gjoliver, @jimthompson5802, @lucasalavapena, @zcin, @clarng, @jbn, @DmitriGekhtman, @timgates42, @charlesjsun, @Yard1, @mgelbart, @wumuzi520, @sihanwang41, @ghost, @jovany-wang, @siavash119, @yuanchi2807, @tupui, @jianoaix, @sumanthratna, @code-review-doctor, @Chong-Li, @FedericoGarza, @ckw017, @Makan-Ar, @kfstorm, @flanaman, @WangTaoTheTonic, @franklsf95, @scv119, @kvaithin, @wuisawesome, @jiaodong, @mgerstgrasser, @tiangolo, @architkulkarni, @MyeongKim, @ericl, @SongGuyang, @avnishn, @chengscott, @shrekris-anyscale, @Alyetama, @iycheng, @rickyyx, @krfricke, @sijieamoy, @kimikuri, @czgdp1807, @michalsustr
Source code(tar.gz)
Source code(zip)
ray-1.13.0(Jun 9, 2022)
Highlights:

Python 3.10 support is now in alpha.

Ray usage stats collection is now on by default (guarded by an opt-out prompt).

Ray Tune can now synchronize Trial data from worker nodes via the object store (without rsync!)

Ray Workflow comes with a new API and is integrated with Ray DAG.

Ray Autoscaler

💫Enhancements:

CI tests for KubeRay autoscaler integration (#23365, #23383, #24195)

Stability enhancements for KubeRay autoscaler integration (#23428)

🔨 Fixes:

Improved GPU support in KubeRay autoscaler integration (#23383)

Resources scheduled with the node affinity strategy are not reported to the autoscaler (#24250)

Ray Client

💫Enhancements:

Add option to configure ray.get with >2 sec timeout (#22165)

Return None from internal KV for non-existent keys (#24058)

🔨 Fixes:

Fix deadlock by switching to SimpleQueue on Python 3.7 and newer in async dataclient (#23995)

Ray Core

🎉 New Features:

Ray usage stats collection is now on by default (guarded by an opt-out prompt)

Alpha support for python 3.10 (on Linux and Mac)

Node affinity scheduling strategy (#23381)

Add metrics for disk and network I/O (#23546)

Improve exponential backoff when connecting to the redis (#24150)

Add the ability to inject a setup hook for customization of runtime_env on init (#24036)

Add a utility to check GCS / Ray cluster health (#23382)

🔨 Fixes:

Fixed internal storage S3 bugs (#24167)

Ensure "get_if_exists" takes effect in the decorator. (#24287)

Reduce memory usage for Pubsub channels that do not require total memory cap (#23985)

Add memory buffer limit in publisher for each subscribed entity (#23707)

Use gRPC instead of socket for GCS client health check (#23939)

Trim size of Reference struct (#23853)

Enable debugging into pickle backend (#23854)

🏗 Architecture refactoring:

Gcs storage interfaces unification (#24211)

Cleanup pickle5 version check (#23885)

Simplify options handling (#23882)

Moved function and actor importer away from pubsub (#24132)

Replace the legacy ResourceSet & SchedulingResources at Raylet (#23173)

Unification of AddSpilledUrl and UpdateObjectLocationBatch RPCs (#23872)

Save task spec in separate table (#22650)

Ray Datasets

🎉 New Features:

Performance improvement: the aggregation computation is vectorized (#23478)

Performance improvement: bulk parquet file reading is optimized with the fast metadata provider (#23179)

Performance improvement: more efficient move semantics for Datasets block processing (#24127)

Supports Datasets lineage serialization (aka out-of-band serialization) (#23821, #23931, #23932)

Supports native Tensor views in map processing for pure-tensor datasets (#24812)

Implemented push-based shuffle (#24281)

🔨 Fixes:

Documentation improvement: Getting Started page (#24860)

Documentation improvement: FAQ (#24932)

Documentation improvement: End to end examples (#24874)

Documentation improvement: Feature guide - Creating Datasets (#24831)

Documentation improvement: Feature guide - Saving Datasets (#24987)

Documentation improvement: Feature guide - Transforming Datasets (#25033)

Documentation improvement: Datasets APIs docstrings (#24949)

Performance: fixed block prefetching (#23952)

Fixed zip() for Pandas dataset (#23532)

🏗 Architecture refactoring:

Refactored LazyBlockList (#23624)

Added path-partitioning support for all content types (#23624)

Added fast metadata provider and refactored Parquet datasource (#24094)

RLlib

🎉 New Features:

Replay buffer API: First algorithms are using the new replay buffer API, allowing users to define and configure their own custom buffers or use RLlib’s built-in ones: SimpleQ, DQN (#24164, #22842, #23523, #23586)

🏗 Architecture refactoring:

More algorithms moved into the training iteration function API (no longer using execution plans). Users can now more easily read, develop, and debug RLlib’s algorithms: A2C, APEX-DQN, CQL, DD-PPO, DQN, MARWIL + BC, PPO, QMIX , SAC, SimpleQ, SlateQ, Trainers defined in examples folder. (#22937, #23420, #23673, #24164, #24151, #23735, #24157, #23798, #23906, #24118, #22842, #24166, #23712). This will be fully completed and documented with Ray 2.0.

Make RolloutWorkers (optionally) recoverable after failure via the new recreate_failed_workers=True config flag. (#23739)

POC for new TrainerConfig objects (instead of python config dicts): PPOConfig (for PPOTrainer) and PGConfig (for PGTrainer). (#24295, #23491)

Hard-deprecate build_trainer() (trainer_templates.py): All custom Trainers should now sub-class from any existing Trainer class. (#23488)

💫Enhancements:

Add support for complex observations in CQL. (#23332)

Bandit support for tf2. (#22838)

Make actions sent by RLlib to the env immutable. (#24262)

Memory leak finding toolset using tracemalloc + CI memory leak tests. (#15412)

Enable DD-PPO to run on Windows. (#23673)

🔨 Fixes:

APPO eager fix (APPOTFPolicy gets wrapped as_eager() twice by mistake). (#24268)

CQL gets stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead). (#24345)

SlateQ runs on GPU (torch). (#23464)

Other bug fixes: #24016, #22050, #23814, #24025, #23740, #23741, #24006, #24005, #24273, #22010, #24271, #23690, #24343, #23419, #23830, #24335, #24148, #21735, #24214, #23818, #24429

Ray Workflow

🎉 New Features:

Workflow step is deprecated (#23796, #23728, #23456, #24210)

🔨 Fixes:

Fix one bug where max_retries is not aligned with ray core’s max_retries. (#22903)

🏗 Architecture refactoring:

Integrate ray storage in workflow (#24120)

Tune

🎉 New Features:

Add RemoteTask based sync client (#23605) (rsync not required anymore!)

Chunk file transfers in cross-node checkpoint syncing (#23804)

Also interrupt training when SIGUSR1 received (#24015)

reuse_actors per default for function trainables (#24040)

Enable AsyncHyperband to continue training for last trials after max_t (#24222)

💫Enhancements:

Improve testing (#23229

Improve docstrings (#23375)

Improve documentation (#23477, #23924)

Simplify trial executor logic (#23396

Make MLflowLoggerUtil copyable (#23333)

Use new Checkpoint interface internally (#22801)

Beautify Optional typehints (#23692)

Improve missing search dependency info (#23691)

Skip tmp checkpoints in analysis and read iteration from metadata (#23859)

Treat checkpoints with nan value as worst (#23862)

Clean up base ProgressReporter API (#24010)

De-clutter log outputs in trial runner (#24257)

hyperopt searcher to support tune.choice([[1,2],[3,4]]). (#24181)

🔨Fixes:

Optuna should ignore additional results after trial termination (#23495)

Fix PTL multi GPU link (#23589)

Improve Tune cloud release tests for durable storage (#23277)

Fix tensorflow distributed trainable docstring (#23590)

Simplify experiment tag formatting, clean directory names (#23672)

Don't include nan metrics for best checkpoint (#23820)

Fix syncing between nodes in placement groups (#23864)

Fix memory resources for head bundle (#23861)

Fix empty CSV headers on trial restart (#23860)

Fix checkpoint sorting with nan values (#23909)

Make Timeout stopper work after restoring in the future (#24217)

Small fixes to tune-distributed for new restore modes (#24220)

Train

Most distributed training enhancements will be captured in the new Ray AIR category!

🔨Fixes:

Copy resources_per_worker to avoid modifying user input

Fix train.torch.get_device() for fractional GPU or multiple GPU per worker case (#23763)

Fix multi node horovod bug (#22564)

Fully deprecate Ray SGD v1 (#24038)

Improvements to fault tolerance (#22511)

MLflow start run under correct experiment (#23662)

Raise helpful error when required backend isn't installed (#23583)

Warn pending deprecation for ray.train.Trainer and ray.tune DistributedTrainableCreators (#24056)

📖Documentation:

add FAQ (#22757)

Ray AIR

🎉 New Features:

HuggingFaceTrainer & HuggingFacePredictor (#23615, #23876)

SklearnTrainer & SklearnPredictor (#23803, #23850)

HorovodTrainer (#23437)

RLTrainer & RLPredictor (#23465, #24172)

BatchMapper preprocessor (#23700)

Categorizer preprocessor (#24180)

BatchPredictor (#23808)

💫Enhancements:

Add Checkpoint.as_directory() for efficient checkpoint fs processing (#23908)

Add config to Result, extend ResultGrid.get_best_config (#23698)

Add Scaling Config validation (#23889)

Add tuner test. (#23364)

Move storage handling to pyarrow.fs.FileSystem (#23370)

Refactor _get_unique_value_indices (#24144)

Refactor most_frequent SimpleImputer (#23706)

Set name of Trainable to match with Trainer #23697

Use checkpoint.as_directory() instead of cleaning up manually (#24113)

Improve file packing/unpacking (#23621)

Make Dataset ingest configurable (#24066)

Remove postprocess_checkpoint (#24297)

🔨Fixes:

Better exception handling (#23695)

Do not deepcopy RunConfig (#23499)

reduce unnecessary stacktrace (#23475)

Tuner should use run_config from Trainer per default (#24079)

Use custom fsspec handler for GS (#24008)

📖Documentation:

Add distributed torch_geometric example (#23580)

GNN example cleanup (#24080)

Serve

🎉 New Features:

Serve logging system was revamped! Access log is now turned on by default. (#23558)

New Gradio notebook example for Ray Serve deployments (#23494)

Serve now includes full traceback in deployment update error message (#23752)

💫Enhancements:

Serve Deployment Graph was enhanced with performance fixes and structural clean up. (#24199, #24026, #24065, #23984)

End to end tutorial for deployment graph (#23512, #22771, #23536)

input_schema is now renamed as http_adapter for usability (#24353, #24191)

Progress towards a declarative REST API (#23232, #23481)

Code cleanup and refactoring (#24067, #23578, #23934, #23759)

Protobuf based controller API for cross language client (#23004)

🔨Fixes:

Handle None in ReplicaConfig's resource_dict (#23851)

Set "memory" to None in ray_actor_options by default (#23619)

Make serve.shutdown() shutdown remote Serve applications (#23476)

Ensure replica reconfigure runs after allocation check (#24052)

Allow cloudpickle serializable objects as init args/kwargs (#24034)

Use controller namespace when getting actors (#23896)

Dashboard

🔨Fixes:

Add toggle to enable showing node disk usage on K8s (#24416, #24440)

Add job submission id as field to job snapshot (#24303)

Thanks Many thanks to all those who contributed to this release! @matthewdeng, @scv119, @xychu, @iycheng, @takeshi-yoshimura, @iasoon, @wumuzi520, @thetwotravelers, @maxpumperla, @krfricke, @jgiannuzzi, @kinalmehta, @avnishn, @dependabot[bot], @sven1977, @raulchen, @acxz, @stephanie-wang, @mgelbart, @xwjiang2010, @jon-chuang, @pdames, @ericl, @edoakes, @gjoseph92, @ddelange, @bkasper, @sriram-anyscale, @Zyiqin-Miranda, @rkooo567, @jbedorf, @architkulkarni, @osanseviero, @simonsays1980, @clarkzinzow, @DmitriGekhtman, @ashione, @smorad, @andenrx, @mattip, @bveeramani, @chaokunyang, @richardliaw, @larrylian, @Chong-Li, @fwitter, @shrekris-anyscale, @gjoliver, @simontindemans, @silky, @grypesc, @ijrsvt, @daikeshi, @kouroshHakha, @mwtian, @mesjou, @sihanwang41, @PavelCz, @czgdp1807, @jianoaix, @GuillaumeDesforges, @pcmoritz, @arsedler9, @n30111, @kira-lin, @ckw017, @max0x7ba, @Yard1, @XuehaiPan, @lchu-ibm, @HJasperson, @SongGuyang, @amogkam, @liuyang-my, @WangTaoTheTonic, @jovany-wang, @simon-mo, @dynamicwebpaige, @suquark, @ArturNiederfahrenhorst, @jjyao, @KepingYan, @jiaodong, @frosk1
Source code(tar.gz)
Source code(zip)
ray-1.12.1(May 16, 2022)
Patch release with the following fixes:

Ray now works on Google Colab again! The bug with memory limit fetching when running Ray in a container is now fixed (https://github.com/ray-project/ray/pull/23922).

ray-ml Docker images for CPU will start being built again after they were stopped in Ray 1.9 (https://github.com/ray-project/ray/pull/24266).

[Train/Tune] Start MLflow run under the correct experiment for Ray Train and Ray Tune integrations (https://github.com/ray-project/ray/pull/23662).

[RLlib] Fix for APPO in eager mode (https://github.com/ray-project/ray/pull/24268).

[RLlib] Fix Alphastar for TF2 and tracing enabled (https://github.com/ray-project/ray/commit/c5502b2aa57376b26408bb297ff68696c16f48f1).

[Serve] Fix replica leak in anonymous namespaces (https://github.com/ray-project/ray/pull/24311).

Source code(tar.gz)
Source code(zip)
ray-1.11.1(May 10, 2022)
Patch release including fixes for the following issues:

Ray Job Submission not working with remote working_dir URLs in their runtime environment (https://github.com/ray-project/ray/pull/22018)

Ray Tune + MLflow integration failing to set MLflow experiment ID (https://github.com/ray-project/ray/pull/23662)

Dependencies for gym not pinned, leading to version incompatibility issues (https://github.com/ray-project/ray/pull/23705)

Source code(tar.gz)
Source code(zip)
ray-1.12.0(Apr 8, 2022)
Highlights

Ray AI Runtime (AIR), an open-source toolkit for building end-to-end ML applications on Ray, is now in Alpha. AIR is an effort to unify the experience of using different Ray libraries (Ray Data, Train, Tune, Serve, RLlib). You can find more information on the docs or on the public RFC.

Getting involved with Ray AIR. We’ll be holding office hours, development sprints, and other activities as we get closer to the Ray AIR Beta/GA release. Want to join us? Fill out this short form!

Ray usage data collection is now off by default. If you have any questions or concerns, please comment on the RFC.

New algorithms are added to RLlib: SlateQ & Bandits (for recommender systems use cases) and AlphaStar (multi-agent, multi-GPU w/ league-based self-play)

Ray Datasets: new lazy execution model with automatic task fusion and memory-optimizing move semantics; first-class support for Pandas DataFrame blocks; efficient random access datasets.

Ray Autoscaler

🎉 New Features

Support cache_stopped_nodes on Azure (#21747)

AWS Cloudwatch support (#21523)

💫 Enhancements

Improved documentation and standards around built in autoscaler node providers. (#22236, 22237)

Improved KubeRay support (#22987, #22847, #22348, #22188)

Remove redis requirement (#22083)

🔨 Fixes

No longer print infeasible warnings for internal placement group resources. Placement groups which cannot be satisfied by the autoscaler still trigger warnings. (#22235)

Default ami’s per AWS region are updated/fixed. (#22506)

GCP node termination updated (#23101)

Retry legacy k8s operator on monitor failure (#22792)

Cap min and max workers for manually managed on-prem clusters (#21710)

Fix initialization artifacts (#22570)

Ensure initial scaleup with high upscaling_speed isn't limited. (#21953)

Ray Client

🎉 New Features:

ray.init has consistent return value in client mode and driver mode #21355

💫Enhancements:

Gets and puts are streamed to support arbitrary object sizes #22100, #22327

🔨 Fixes:

Fix ray client object ref releasing in wrong context #22025

Ray Core

🎉 New Features

RuntimeEnv:

Support setting timeout for runtime_env setup. (#23082)

Support setting pip_check and pip_version for runtime_env. (#22826, #23306)

env_vars will take effect when the pip install command is executed. (temporarily ineffective in conda) (#22730)

Support strongly-typed API ray.runtime.RuntimeEnv to define runtime env. (#22522)

Introduce virtualenv to isolate the pip type runtime env. (#21801,#22309)

Raylet shares fate with the dashboard agent. And the dashboard agent will stay alive when it catches the port conflicts. (#22382,#23024)

Enable dashboard in the minimal ray installation (#21896)

Add task and object reconstruction status to ray memory cli tools(#22317)

🔨 Fixes

Report only memory usage of pinned object copies to improve scaledown. (#22020)

Scheduler:

No spreading if a node is selected for lease request due to locality. (#22015)

Placement group scheduling: Non-STRICT_PACK PGs should be sorted by resource priority, size (#22762)

Round robin during spread scheduling (#21303)

Object store:

Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120)

Cleanup handling for nondeterministic object size during transfer (#22639)

Fix bug in fusion for spilled objects (#22571)

Handle IO worker failures correctly (#20752)

Improve ray stop behavior (#22159)

Avoid warning when receiving too much logs from a different job (#22102)

Gcs resource manager bug fix and clean up. (#22462, #22459)

Release GIL when running parallel_memcopy() / memcpy() during serializations. (#22492)

Fix registering serializer before initializing Ray. (#23031)

🏗 Architecture refactoring

Ray distributed scheduler refactoring: (#21927, #21992, #22160, #22359, #22722, #22817, #22880, #22893, #22885, #22597, #22857, #23124)

Removed support for bootstrapping with Redis.

Ray Data Processing

🎉 New Features

Big Performance and Stability Improvements:

Add lazy execution mode with automatic stage fusion and optimized memory reclamation via block move semantics (#22233, #22374, #22373, #22476)

Support for random access datasets, providing efficient random access to rows via binary search (#22749)

Add automatic round-robin load balancing for reading and shuffle reduce tasks, obviating the need for the _spread_resource_prefix hack (#21303)

More Efficient Tabular Data Wrangling:

Add first-class support for Pandas blocks, removing expensive Arrow <-> Pandas conversion costs (#21894)

Expose TableRow API + minimize copies/type-conversions on row-based ops (#22305)

Groupby + Aggregations Improvements:

Support mapping over groupby groups (#22715)

Support ignoring nulls in aggregations (#20787)

Improved Dataset Windowing:

Support windowing a dataset by bytes instead of number of blocks (#22577)

Batch across windows in DatasetPipelines (#22830)

Better Text I/O:

Support streaming snappy compression for text files (#22486)

Allow for custom decoding error handling in read_text() (#21967)

Add option for dropping empty lines in read_text() (#22298)

New Operations:

Add add_column() utility for adding derived columns (#21967)

Support for metadata provider callback for read APIs (#22896)

Support configuring autoscaling actor pool size (#22574)

🔨 Fixes

Force lazy datasource materialization in order to respect DatasetPipeline stage boundaries (#21970)

Simplify lifetime of designated block owner actor, and don’t create it if dynamic block splitting is disabled (#22007)

Respect 0 CPU resource request when using manual resource-based load balancing (#22017)

Remove batch format ambiguity by always converting Arrow batches to Pandas when batch_format=”native” is given (#21566)

Fix leaked stats actor handle due to closure capture reference counting bug (#22156)

Fix boolean tensor column representation and slicing (#22323)

Fix unhandled empty block edge case in shuffle (#22367)

Fix unserializable Arrow Partitioning spec (#22477)

Fix incorrect iter_epochs() batch format (#22550)

Fix infinite iter_epochs() loop on unconsumed epochs (#22572)

Fix infinite hang on split() when num_shards < num_rows (#22559)

Patch Parquet file fragment serialization to prevent metadata fetching (#22665)

Don’t reuse task workers for actors or GPU tasks (#22482)

Pin pipeline executor actors to driver node to allow for lineage-based fault tolerance for pipelines (#22715)

Always use non-empty blocks to determine schema (#22834)

API fix bash (#22886)

Make label_column optional for to_tf() so it can be used for inference (#22916)

Fix schema() for DatasetPipelines (#23032)

Fix equalized split when num_splits == num_blocks (#23191)

💫 Enhancements

Optimize Parquet metadata serialization via batching (#21963)

Optimize metadata read/write for Ray Client (#21939)

Add sanity checks for memory utilization (#22642)

🏗 Architecture refactoring

Use threadpool to submit DatasetPipeline stages (#22912)

RLlib

🎉 New Features

New “AlphaStar” algorithm: A parallelized, multi-agent/multi-GPU learning algorithm, implementing league-based self-play. (#21356, #21649)

SlateQ algorithm has been re-tested, upgraded (multi-GPU capable, TensorFlow version), and bug-fixed (added to weekly learning tests). (#22389, #23276, #22544, #22543, #23168, #21827, #22738)

Bandit algorithms: Moved into agents folder as first-class citizens, TensorFlow-Version, unified w/ other agents’ APIs. (#22821, #22028, #22427, #22465, #21949, #21773, #21932, #22421)

ReplayBuffer API (in progress): Allow users to customize and configure their own replay buffers and use these inside custom or built-in algorithms. (#22114, #22390, #21808)

Datasets support for RLlib: Dataset Reader/Writer and documentation. (#21808, #22239, #21948)

🔨 Fixes

Fixed memory leak in SimpleReplayBuffer. (#22678)

Fixed Unity3D built-in examples: Action bounds from -inf/inf to -1.0/1.0. (#22247)

Various bug fixes. (#22350, #22245, #22171, #21697, #21855, #22076, #22590, #22587, #22657, #22428, #23063, #22619, #22731, #22534, #22074, #22078, #22641, #22684, #22398, #21685)

🏗 Architecture refactoring

A3C: Moved into new training_iteration API (from exeution_plan API). Lead to a ~2.7x performance increase on a Atari + CNN + LSTM benchmark. (#22126, #22316)

Make multiagent->policies_to_train more flexible via callable option (alternative to providing a list of policy IDs). (#20735)

💫Enhancements:

Env pre-checking module now active by default. (#22191)

Callbacks: Added on_sub_environment_created and on_trainer_init callback options. (#21893, #22493)

RecSim environment wrappers: Ability to use google’s RecSim for recommender systems more easily w/ RLlib algorithms (3 RLlib-ready example environments). (#22028, #21773, #22211)

MARWIL loss function enhancement (exploratory term for stddev). (#21493)

📖Documentation:

Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239)

Other doc enhancements and fixes. (#23160, #23226, #22496, #22489, #22380)

Ray Workflow

🎉 New Features:

Support skip checkpointing.

🔨 Fixes:

Fix an issue where the event loop is not set.

Tune

🎉 New Features:

Expose new checkpoint interface to users (#22741)

💫Enhancements:

Better error msg for grpc resource exhausted error. (#22806)

Add CV support for XGB/LGBM Tune callbacks (#22882)

Make sure tune.run can run inside worker thread (https://github.com/ray-project/ray/commit/b8c28d1f2beb7a141f80a5fd6053c8e8520718b9)#22566 )

Add Trainable.postprocess_checkpoint (#22973) Trainables will now know TUNE_ORIG_WORKING_DIR (#22803 )

Retry cloud sync up/down/delete on fail (#22029)

Support functools.partial names and treat as function in registry (#21518)

🔨Fixes:

Cleanup incorrectly formatted strings (Part 2: Tune) (#23129 )

fix error handling for fail_fast case. (#22982)

Remove Trainable.update_resources (#22471)

Bump flaml from 0.6.7 to 0.9.7 in /python/requirements/ml (#22071)

Fix analysis without registered trainable (#21475)

Update Lightning examples to support PTL 1.5 (#20562)

Fix WandbTrainableMixin config for rllib trainables (#22063)

[wandb] Use resume=False per default (#21892)

🏗 Refactoring:

Move resource updater out of trial executor (#23178 )

Preparation for deadline schedulers (#22006 )

Single wait refactor. (#21852 )

📖Documentation:

Tune docs overhaul (first part) (#22112)

Tune overhaul part II (#22656 )

Note TPESampler performance issues in docs (#22545)

hyperopt notebook (#22315)

Train

🎉 New Features

Integration with PyTorch profiler. Easily enable the pytorch profiler with Ray Train to profile training and visualize stats in Tensorboard (#22345).

Automatic pipelining of host to device transfer. While training is happening on one batch of data, the next batch of data is concurrently being moved from CPU to GPU (#22716, #22974)

Automatic Mixed Precision. Easily enable PyTorch automatic mixed precision during training (#22227).

💫 Enhancements

Add utility function to enable reproducibility for Pytorch training (#22851)

Add initial support for metrics aggregation (#22099)

Add support for trainer.best_checkpoint and Trainer.load_checkpoint_path. You can now directly access the best in memory checkpoint, or load an arbitrary checkpoint path to memory. (#22306)

🔨 Fixes

Add a utility function to turn off TF autosharding (#21887)

Fix fault tolerance for Tensorflow training (#22508)

Train utility methods (train.report(), etc.) can now be called outside of a Train session (#21969)

Fix accuracy calculation for CIFAR example (#22292)

Better error message for placement group time out (#22845)

📖 Documentation

Update docs for ray.train.torch import (#22555)

Clarify shuffle documentation in prepare_data_loader (#22876)

Denote train.torch.get_device as a Public API (#22024)

Minor fixes on Ray Train user guide doc (#22379)

Serve

🎉 New Features

Deployment Graph API is now in alpha. It provides a way to build, test and deploy complex inference graph composed of many deployments. (#23177, #23252, #23301, #22840, #22710, #22878, #23208, #23290, #23256, #23324, #23289, #23285, #22473, #23125, #23210)

New experimental REST API and CLI for creating and managing deployments. ( #22839, #22257, #23198, #23027, #22039, #22547, #22578, #22611, #22648, #22714, #22805, #22760, #22917, #23059, #23195, #23265, #23157, #22706, #23017, #23026, #23215)

New sets of HTTP adapters making it easy to build simple application, as well as Ray AI Runtime model wrappers in alpha. (#22913, #22914, #22915, #22995)

New health_check API for end to end user provided health check. (#22178, #22121, #22297)

🔨 Fixes

Autoscaling algorithm will now relingquish most idle nodes when scaling down (#22669)

Serve can now manage Java replicas (#22628)

Added a hands-on self-contained MLflow and Ray Serve deployment example (#22192)

Added root_path setting to http_options (#21090)

Remove shard_key, http_method, and http_headers in ServeHandle (#21590)

Dashboard

🔨Fixes:

Update CPU and memory reporting in kubernetes. (#21688)

Thanks

Many thanks to all those who contributed to this release! @edoakes, @pcmoritz, @jiaodong, @iycheng, @krfricke, @smorad, @kfstorm, @jjyyxx, @rodrigodelazcano, @scv119, @dmatrix, @avnishn, @fyrestone, @clarkzinzow, @wumuzi520, @gramhagen, @XuehaiPan, @iasoon, @birgerbr, @n30111, @tbabej, @Zyiqin-Miranda, @suquark, @pdames, @tupui, @ArturNiederfahrenhorst, @ashione, @ckw017, @siddgoel, @Catch-Bull, @vicyap, @spolcyn, @stephanie-wang, @mopga, @Chong-Li, @jjyao, @raulchen, @sven1977, @nikitavemuri, @jbedorf, @mattip, @bveeramani, @czgdp1807, @dependabot[bot], @Fabien-Couthouis, @willfrey, @mwtian, @SlowShip, @Yard1, @WangTaoTheTonic, @Wendi-anyscale, @kaushikb11, @kennethlien, @acxz, @DmitriGekhtman, @matthewdeng, @mraheja, @orcahmlee, @richardliaw, @dsctt, @yupbank, @Jeffwan, @gjoliver, @jovany-wang, @clay4444, @shrekris-anyscale, @jwyyy, @kyle-chen-uber, @simon-mo, @ericl, @amogkam, @jianoaix, @rkooo567, @maxpumperla, @architkulkarni, @chenk008, @xwjiang2010, @robertnishihara, @qicosmos, @sriram-anyscale, @SongGuyang, @jon-chuang, @wuisawesome, @valiantljk, @simonsays1980, @ijrsvt
Source code(tar.gz)
Source code(zip)
ray-1.11.0(Mar 9, 2022)
Highlights

🎉 Ray no longer starts Redis by default. Cluster metadata previously stored in Redis is stored in the GCS now.

Ray Autoscaler

🎉 New Features

AWS Cloudwatch dashboard support #20266

💫 Enhancements

Kuberay autoscaler prototype #21086

🔨 Fixes

Ray.autoscaler.sdk import issue #21795

Ray Core

🎉 New Features

Set actor died error message in ActorDiedError #20903

Event stats is enabled by default #21515

🔨 Fixes

Better support for nested tasks

Fixed 16GB mac perf issue by limit the plasma store size to 2GB #21224

Fix SchedulingClassInfo.running_tasks memory leak #21535

Round robin during spread scheduling #19968

🏗 Architecture refactoring

Refactor scheduler resource reporting public APIs #21732

Refactor ObjectManager wait logic to WaitManager #21369

Ray Data Processing

🎉 New Features

More powerful to_torch() API, providing more control over the GPU batch format. (#21117)

🔨 Fixes

Fix simple Dataset sort generating only 1 non-empty block. (#21588)

Improve error handling across sorting, groupbys, and aggregations. (#21610, #21627)

Fix boolean tensor column representation and slicing. (#22358)

RLlib

🎉 New Features

Better utils for flattening complex inputs and enable prev-actions for LSTM/attention for complex action spaces. (#21330)

MultiAgentEnv pre-checker (#21476)

Base env pre-checker. (#21569)

🔨 Fixes

Better defaults for QMix (#21332)

Fix contrib/MADDPG + pettingzoo coop-pong-v4. (#21452)

Fix action unsquashing causes inf/NaN actions for unbounded action spaces. (#21110)

Ignore PPO KL-loss term completely if kl-coeff == 0.0 to avoid NaN values (#21456)

unsquash_action and clip_action (when None) cause wrong actions computed by Trainer.compute_single_action. (#21553)

Conv2d default filter tests and add default setting for 96x96 image obs space. (#21560)

Bing back and fix offline RL(BC & MARWIL) learning tests. (#21574, #21643)

SimpleQ should not use a prio. replay buffer. (#21665)

Fix video recorder env wrapper. Added test case. (#21670)

🏗 Architecture refactoring

Decentralized multi-agent learning (#21421)

Preparatory PR for multi-agent multi-GPU learner (alpha-star style) (#21652)

Ray Workflow

🔨 Fixes

Fixed workflow recovery issue due to a bug of dynamic output #21571

Tune

🎉 New Features

It is now possible to load all evaluated points from an experiment into a Searcher (#21506)

Add CometLoggerCallback (#20766)

💫 Enhancements

Only sync the checkpoint folder instead of the entire trial folder for cloud checkpoint. (#21658)

Add test for heterogeneous resource request deadlocks (#21397)

Remove unused return_or_clean_cached_pg (#21403)

Remove TrialExecutor.resume_trial (#21225)

Leave only one canonical way of stopping a trial (#21021)

🔨 Fixes

Replace deprecated running_sanity_check with sanity_checking in PTL integration (#21831)

Fix loading an ExperimentAnalysis object without a registered Trainable (#21475)

Fix stale node detection bug (#21516)

Fixes to allow tune/tests/test_commands.py to run on Windows (#21342)

Deflake PBT tests (#21366)

Fix dtype coercion in tune.choice (#21270)

📖 Documentation

Fix typo in schedulers.rst (#21777)

Train

🎉 New Features

Add PrintCallback (#21261)

Add MLflowLoggerCallback(#20802)

💫 Enhancements

Refactor Callback implementation (#21468, #21357, #21262)

🔨 Fixes

Fix Dataloader (#21467)

📖 Documentation

Documentation and example fixes (#21761, #21689, #21464)

Serve

🎉 New Features

Checkout our revampt end-to-end tutorial that walks through the deployment journey! (#20765)

🔨 Fixes

Warn when serve.start() with different options (#21562)

Detect http.disconnect and cancel requests properly (#21438)

Thanks Many thanks to all those who contributed to this release! @isaac-vidas, @wuisawesome, @stephanie-wang, @jon-chuang, @xwjiang2010, @jjyao, @MissiontoMars, @qbphilip, @yaoyuan97, @gjoliver, @Yard1, @rkooo567, @talesa, @czgdp1807, @DN6, @sven1977, @kfstorm, @krfricke, @simon-mo, @hauntsaninja, @pcmoritz, @JamieSlome, @chaokunyang, @jovany-wang, @sidward14, @DmitriGekhtman, @ericl, @mwtian, @jwyyy, @clarkzinzow, @hckuo, @vakker, @HuangLED, @iycheng, @edoakes, @shrekris-anyscale, @robertnishihara, @avnishn, @mickelliu, @ndrwnaguib, @ijrsvt, @Zyiqin-Miranda, @bveeramani, @SongGuyang, @n30111, @WangTaoTheTonic, @suquark, @richardliaw, @qicosmos, @scv119, @architkulkarni, @lixin-wei, @Catch-Bull, @acxz, @benblack769, @clay4444, @amogkam, @marin-ma, @maxpumperla, @jiaodong, @mattip, @isra17, @raulchen, @wilsonwang371, @carlogrisetti, @ashione, @matthewdeng
Source code(tar.gz)
Source code(zip)
ray-1.10.0(Feb 4, 2022)
Highlights

🎉 Ray Windows support is now in beta – a significant fraction of the Ray test suite is now passing on Windows. We are eager to learn about your experience with Ray 1.10 on Windows, please file issues you encounter at https://github.com/ray-project/ray/issues. In the upcoming releases we will spend more time on making Ray Serve and Runtime Environment tests pass on Windows and on polishing things.

Ray Autoscaler

💫Enhancements:

Add autoscaler update time to prometheus metrics (#20831)

Fewer non terminated nodes calls in autoscaler update (#20359, #20623)

🔨 Fixes:

GCP TPU autoscaling fix (#20311)

Scale-down stability fix (#21204)

Report node launch failure in driver logs (#20814)

Ray Client

💫Enhancements

Client task options are encoded with pickle instead of json (#20930)

Ray Core

🎉 New Features:

runtime_env’s pip field now installs pip packages in your existing environment instead of installing them in a new isolated environment. (#20341)

🔨 Fixes:

Fix bug where specifying runtime_env conda/pip per-job using local requirements file using Ray Client on a remote cluster didn’t work (#20855)

Security fixes for log4j2 – the log4j2 version has been bumped to 2.17.1 (#21373)

💫Enhancements:

Allow runtime_env working_dir and py_modules to be pathlib.Path type (#20853, #20810)

Add environment variable to skip local runtime_env garbage collection (#21163)

Change runtime_env error log to debug log (#20875)

Improved reference counting for runtime_env resources (#20789)

🏗 Architecture refactoring:

Refactor runtime_env to use protobuf for multi-language support (#19511)

📖Documentation:

Add more comprehensive runtime_env documentation (#20222, #21131, #20352)

Ray Data Processing

🎉 New Features:

Added stats framework for debugging Datasets performance (#20867, #21070)

[Dask-on-Ray] New config helper for enabling the Dask-on-Ray scheduler (#21114)

💫Enhancements:

Reduce memory usage during when converting to a Pandas DataFrame (#20921)

🔨 Fixes:

Fix slow block evaluation when splitting (#20693)

Fix boundary sampling concatenation on non-uniform blocks (#20784)

Fix boolean tensor column slicing (#20905)

🏗 Architecture refactoring:

Refactor table block structure to support more tabular block formats (#20721)

RLlib

🎉 New Features:

Support for RE3 exploration algorithm (for tf only). (#19551)

Environment pre-checks, better failure behavior and enhanced environment API. (#20481, #20832, #20868, #20785, #21027, #20811)

🏗 Architecture refactoring:

Evaluation: Support evaluation setting that makes sure train doesn't ever have to wait for eval to finish (b/c of long episodes). (#20757); Always attach latest eval metrics. (#21011)

Soft-deprecate build_trainer() utility function in favor of sub-classing Trainer directly (and overriding some of its methods). (#20635, #20636, #20633, #20424, #20570, #20571, #20639, #20725)

Experimental no-flatten option for actions/prev-actions. (#20918)

Use SampleBatch instead of an input dict whenever possible. (#20746)

Switch off Preprocessors by default for PGTrainer (experimental). (#21008)

Toward a Replay Buffer API (cleanups; docstrings; renames; move into rllib/execution/buffers dir) (#20552)

📖Documentation:

Overhaul of auto-API reference pages. (#19786, #20537, #20538, #20486, #20250)

README and RLlib landing page overhaul (#20249).

Added example containing code to compute an adapted (time-dependent) GAE used by the PPO algorithm (#20850).

🔨 Fixes:

Smaller fixes and enhancements: #20704, #20541, #20793, #20743.

Tune

🎉 New Features:

Introduce TrialCheckpoint class, making checkpoint down/upload easie (#20585)

Add random state to BasicVariantGenerator (#20926)

Multi-objective support for Optuna (#20489)

💫Enhancements:

Add set_max_concurrency to Searcher API (#20576)

Allow for tuples in _split_resolved_unresolved_values. (#20794)

Show the name of training func, instead of just ImplicitFunction. (#21029)

Enforce one future at a time for any given trial at any given time. (#20783) move on_no_available_trials to a subclass under runner (#20809)

Clean up code (#20555, #20464, #20403, #20653, #20796, #20916, #21067)

Start restricting TrialRunner/Executor interface exposures. (#20656)

TrialExecutor should not take in Runner interface. (#20655)

🔨Fixes:

Deflake test_tune_restore.py (#20776)

Fix best_trial_str for nested custom parameter columns (#21078)

Fix checkpointing error message on K8s (#20559)

Fix testResourceScheduler and testMultiStepRun. (#20872)

Fix tune cloud tests for function and rllib trainables (#20536)

Move _head_bundle_is_empty after conversion (#21039)

Elongate test_trial_scheduler_pbt timeout. (#21120)

Train

🔨Fixes:

Ray Train environment variables are automatically propagated and do not need to be manually set on every node (#20523)

Various minor fixes and improvements (#20952, #20893, #20603, #20487) 📖Documentation:

Update saving/loading checkpoint docs (#20973). Thanks @jwyyy!

Various minor doc updates (#20877, #20683)

Serve

💫Enhancements:

Add validation to Serve AutoscalingConfig class (#20779)

Add Serve metric for HTTP error codes (#21009)

🔨Fixes:

No longer create placement group for deployment with no resources (#20471)

Log errors in deployment initialization/configuration user code (#20620)

Jobs

🎉 New Features:

Logs can be streamed from job submission server with ray job logs command (#20976)

Add documentation for ray job submission (#20530)

Propagate custom headers field to JobSubmissionClient and apply to all requests (#20663)

🔨Fixes:

Fix job serve accidentally creates local ray processes instead of connecting (#20705)

💫Enhancements:

[Jobs] Update CLI examples to use the same setup (#20844)

Thanks

Many thanks to all those who contributed to this release!

@dmatrix, @suquark, @tekumara, @jiaodong, @jovany-wang, @avnishn, @simon-mo, @iycheng, @SongGuyang, @ArturNiederfahrenhorst, @wuisawesome, @kfstorm, @matthewdeng, @jjyao, @chenk008, @Sertingolix, @larrylian, @czgdp1807, @scv119, @duburcqa, @runedog48, @Yard1, @robertnishihara, @geraint0923, @amogkam, @DmitriGekhtman, @ijrsvt, @kk-55, @lixin-wei, @mvindiola1, @hauntsaninja, @sven1977, @Hankpipi, @qbphilip, @hckuo, @newmanwang, @clay4444, @edoakes, @liuyang-my, @iasoon, @WangTaoTheTonic, @fgogolli, @dproctor, @gramhagen, @krfricke, @richardliaw, @bveeramani, @pcmoritz, @ericl, @simonsays1980, @carlogrisetti, @stephanie-wang, @AmeerHajAli, @mwtian, @xwjiang2010, @shrekris-anyscale, @n30111, @lchu-ibm, @Scalsol, @seonggwonyoon, @gjoliver, @qicosmos, @xychu, @iamhatesz, @architkulkarni, @jwyyy, @rkooo567, @mattip, @ckw017, @MissiontoMars, @clarkzinzow
Source code(tar.gz)
Source code(zip)
ray-1.9.2(Jan 11, 2022)

Patch release to bump the log4j version from 2.16.0 to 2.17.0. This resolves the security issue CVE-2021-45105.
Source code(tar.gz)
Source code(zip)
ray-1.9.1(Dec 22, 2021)

Patch release to bump the log4j2 version from 2.14 to 2.16. This resolves the security vulnerabilities https://nvd.nist.gov/vuln/detail/CVE-2021-44228 and https://nvd.nist.gov/vuln/detail/CVE-2021-45046.

No library or core changes included.

Thanks @seonggwonyoon and @ijrsvt for contributing the fixes!
Source code(tar.gz)
Source code(zip)
ray-1.9.0(Dec 3, 2021)
Highlights

Ray Train is now in beta! If you are using Ray Train, we’d love to hear your feedback here!

Ray Docker images for multiple CUDA versions are now provided (#19505)! You can specify a -cuXXX suffix to pick a specific version.

ray-ml:cpu images are now deprecated. The ray-ml images are only built for GPU.

Ray Datasets now supports groupby and aggregations! See the groupby API and GroupedDataset docs for usage.

We are making continuing progress in improving Ray stability and usability on Windows. We encourage you to try it out and report feedback or issues at https://github.com/ray-project/ray/issues.

We are launching a Ray Job Submission server + CLI & SDK clients to make it easier to submit and monitor Ray applications when you don’t want an active connection using Ray Client. This is currently in alpha, so the APIs are subject to change, but please test it out and file issues / leave feedback on GitHub & discuss.ray.io!

Ray Autoscaler

💫Enhancements:

Graceful termination of Ray nodes prior to autoscaler scale down (#20013)

Ray Clusters on AWS are colocated in one Availability Zone to reduce costs & latency (#19051)

Ray Client

🔨 Fixes:

ray.put on a list of of objects now returns a single object ref (#19737)

Ray Core

🎉 New Features:

Support remote file storage for runtime_env (#20280, #19315)

Added ray job submission client, cli and rest api (#19567, #19657, #19765, #19845, #19851, #19843, #19860, #19995, #20094, #20164, #20170, #20192, #20204)

💫Enhancements:

Garbage collection for runtime_env (#20009, #20072)

Improved logging and error messages for runtime_env (#19897, #19888, #18893)

🔨 Fixes:

Fix runtime_env hanging issues (#19823)

Fix specifying runtime env in @ray.remote decorator with Ray Client (#19626)

Threaded actor / core worker / named actor race condition fixes (#19751, #19598, #20178, #20126)

📖Documentation:

New page “Handling Dependencies”

New page “Ray Job Submission: Going from your laptop to production”

Ray Java

API Changes:

Fully supported namespace APIs. (Check out the namespace for more information.) #19468 #19986 #20057

Removed global named actor APIs and global placement group APIs. #20219 #20135

Added timeout parameter for Ray.Get() API. #20282

Note:

Use Ray.getActor(name, namespace) API to get a named actor between jobs instead of Ray.getGlobalActor(name).

Use PlacementGroup.getPlacementGroup(name, namespace) API to get a placement group between jobs instead of PlacementGroup.getGlobalPlacementGroup(name).

Ray Datasets

🎉 New Features:

Added groupby and aggregations (#19435, #19673, #20010, #20035, #20044, #20074)

Support custom write paths (#19347)

🔨 Fixes:

Support custom CSV write options (#19378)

🏗 Architecture refactoring:

Optimized block compaction (#19681)

Ray Workflow

🎉 New Features:

Workflow right now support events (#19239)

Allow user to specify metadata for workflow and steps (#19372)

Allow in-place run a step if the resources match (#19928)

🔨 Fixes:

Fix the s3 path issue (#20115)

RLlib

🏗 Architecture refactoring:

“framework=tf2” + “eager_tracing=True” is now (almost) as fast as “framework=tf”. A check for tf2.x eager re-traces has been added making sure re-tracing does not happen outside the initial function calls. All CI learning tests (CartPole, Pendulum, FrozenLake) are now also run as framework=tf2. (#19273, #19981, #20109)

Prepare deprecation of build_trainer/build_(tf_)?policy utility functions. Instead, use sub-classing of Trainer or Torch|TFPolicy. POCs done for PGTrainer, PPO[TF|Torch]Policy. (#20055, #20061)

V-trace (APPO & IMPALA): Don’t drop last ts can be optionally switch on. The default is still to drop it, but this may be changed in a future release. (#19601)

Upgrade to gym 0.21. (#19535)

🔨 Fixes:

Minor bugs/issues fixes and enhancements: #19069, #19276, #19306, #19408, #19544, #19623, #19627, #19652, #19693, #19805, #19807, #19809, #19881, #19934, #19945, #20095, #20128, #20134, #20144, #20217, #20283, #20366, #20387

📖Documentation:

RLlib main page (“RLlib in 60sec”) overhaul. (#20215, #20248, #20225, #19932, #19982)

Major docstring cleanups in preparation for complete overhaul of API reference pages. (#19784, #19783, #19808, #19759, #19829, #19758, #19830)

Other documentation enhancements. (#19908, #19672, #20390)

Tune

💫Enhancements:

Refactored and improved experiment analysis (#20197, #20181)

Refactored cloud checkpointing API/SyncConfig (#20155, #20418, #19632, #19641, #19638, #19880, #19589, #19553, #20045, #20283)

Remove magic results (e.g. config) before calculating trial result metrics (#19583)

Removal of tech debt (#19773, #19960, #19472, #17654)

Improve testing (#20016, #20031, #20263, #20210, #19730

Various enhancements (#19496, #20211)

🔨Fixes:

Documentation fixes (#20130, #19791)

Tutorial fixes (#20065, #19999)

Drop 0 value keys from PGF (#20279)

Fix shim error message for scheduler (#19642)

Avoid looping through _live_trials twice in _get_next_trial. (#19596)

clean up legacy branch in update_avail_resources. (#20071)

fix Train/Tune integration on Client (#20351)

Train

Ray Train is now in Beta! The beta version includes various usability improvements for distributed PyTorch training and checkpoint management, support for Ray Client, and an integration with Ray Datasets for distributed data ingest.

Check out the docs here, and the migration guide from Ray SGD to Ray Train here. If you are using Ray Train, we’d love to hear your feedback here!

🎉 New Features:

New train.torch.prepare_model(...) and train.torch.prepare_data_loader(...) API to automatically handle preparing your PyTorch model and DataLoader for distributed training (#20254).

Checkpoint management and support for custom checkpoint strategies (#19111).

Easily configure what and how many checkpoints to save to disk.

Support for Ray Client (#20123, #20351).

💫Enhancements:

Simplify workflow for training with a single worker (#19814).

Ray Placement Groups are used for scheduling the training workers (#20091).

PACK strategy is used by default but can be changed by setting the TRAIN_ENABLE_WORKER_SPREAD environment variable.

Automatically unwrap Torch DDP model and convert to CPU when saving a model as checkpoint (#20333).

🔨Fixes:

Fix HorovodBackend to automatically detect NICs- thanks @tgaddair! (#19533).

📖Documentation:

Denote public facing APIs with beta stability (#20378)

Doc updates (#20271)

Serve

We would love to hear from you! Fill out the Ray Serve survey here.

🎉 New Features:

New checkpoint_path configuration allows Serve to save its internal state to external storage (disk, S3, and GCS) and recover upon failure. (#19166, #19998, #20104)

Replica autoscaling is ready for testing out! (#19559, #19520)

Native Pipeline API for model composition is ready for testing as well!

🔨Fixes:

Serve deployment functions or classes can take no parameters (#19708)

Replica slow start message is improved. You can now see whether it is slow to allocate resources or slow to run constructor. (#19431)

pip install ray[serve] will now install ray[default] as well. (#19570)

🏗 Architecture refactoring:

The terminology of “backend” and “endpoint” are officially deprecated in favor of “deployment”. (#20229, #20085, #20040, #20020, #19997, #19947, #19923, #19798).

Progress towards Java API compatibility (#19463).

Dashboard

Ray Dashboard is now enabled on Windows! (#19575)

Thanks

Many thanks to all those who contributed to this release! @krfricke, @stefanbschneider, @ericl, @nikitavemuri, @qicosmos, @worldveil, @triciasfu, @AmeerHajAli, @javi-redondo, @architkulkarni, @pdames, @clay4444, @mGalarnyk, @liuyang-my, @matthewdeng, @suquark, @rkooo567, @mwtian, @chenk008, @dependabot[bot], @iycheng, @jiaodong, @scv119, @oscarknagg, @Rohan138, @stephanie-wang, @Zyiqin-Miranda, @ijrsvt, @roireshef, @tkaymak, @simon-mo, @ashione, @jovany-wang, @zenoengine, @tgaddair, @11rohans, @amogkam, @zhisbug, @lchu-ibm, @shrekris-anyscale, @pcmoritz, @yiranwang52, @mattip, @sven1977, @Yard1, @DmitriGekhtman, @ckw017, @WangTaoTheTonic, @wuisawesome, @kcpevey, @kfstorm, @rhamnett, @renos, @TeoZosa, @SongGuyang, @clarkzinzow, @avnishn, @iasoon, @gjoliver, @jjyao, @xwjiang2010, @dmatrix, @edoakes, @czgdp1807, @heng2j, @sungho-joo, @lixin-wei
Source code(tar.gz)
Source code(zip)
ray-1.8.0(Nov 2, 2021)
Highlights

Ray SGD has been rebranded to Ray Train! The new documentation landing page can be found here.

Ray Datasets is now in beta! The beta release includes a new integration with Ray Train yielding scalable ML ingest for distributed training. Check out the docs here, try it out for your ML ingest and batch inference workloads, and let us know how it goes!

This Ray release supports Apple Silicon (M1 Macs). Check out the installation instructions for more information!

Ray Autoscaler

🎉 New Features:

Fake multi-node mode for autoscaler testing (#18987)

💫Enhancements:

Improve unschedulable task warning messages by integrating with the autoscaler (#18724)

Ray Client

💫Enhancements

Use async rpc for remote call and actor creation (#18298)

Ray Core

💫Enhancements

Eagerly install job-level runtime_env (#19449, #17949)

🔨 Fixes:

Fixed resource demand reporting for infeasible 1-CPU tasks (#19000)

Fixed printing Python stack trace in Python worker (#19423)

Fixed macOS security popups (#18904)

Fixed thread safety issues for coreworker (#18902, #18910, #18913 #19343)

Fixed placement group performance and resource leaking issues (#19277, #19141, #19138, #19129, #18842, #18652)

Improve unschedulable task warning messages by integrating with the autoscaler (#18724)

Improved Windows support (#19014, #19062, #19171, #19362)

Fix runtime_env issues (#19491, #19377, #18988)

Ray Data

Ray Datasets is now in beta! The beta release includes a new integration with Ray Train yielding scalable ML ingest for distributed training. It supports repeating and rewindowing pipelines, zipping two pipelines together, better cancellation of Datasets workloads, and many performance improvements. Check out the docs here, try it out for your ML ingest and batch inference workloads, and let us know how it goes!

🎉 New Features:

Ray Train integration (#17626)

Add support for repeating and rewindowing a DatasetPipeline (#19091)

.iter_epochs() API for iterating over epochs in a DatasetPipeline (#19217)

Add support for zipping two datasets together (#18833)

Transformation operations are now cancelled when one fails or the entire workload is killed (#18991)

Expose from_pandas()/to_pandas() APIs that accept/return plain Pandas DataFrames (#18992)

Customize compression, read/write buffer size, metadata, etc. in the IO layer (#19197)

Add spread resource prefix for manual round-robin resource-based task load balancing

💫Enhancements:

Minimal rows are now dropped when doing an equalized split (#18953)

Parallelized metadata fetches when reading Parquet datasets (#19211)

🔨 Fixes:

Tensor columns now properly support table slicing (#19534)

Prevent Datasets tasks from being captured by Ray Tune placement groups (#19208)

Empty datasets are properly handled in most transformations (#18983)

🏗 Architecture refactoring:

Tensor dataset representation changed to a table with a single tensor column (#18867)

RLlib

🎉 New Features:

Allow n-step > 1 and prioritized replay for R2D2 and RNNSAC agents. (18939)

🔨 Fixes:

Fix memory leaks in TF2 eager mode. (#19198)

Faster worker spaces inference if specified through configuration. (#18805)

Fix bug for complex obs spaces containing Box([2D shape]) and discrete components. (#18917)

Torch multi-GPU stats not protected against race conditions. (#18937)

Fix SAC agent with dict space. (#19101)

Fix A3C/IMPALA in multi-agent setting. (#19100)

🏗 Architecture refactoring:

Unify results dictionary returned from Trainer.train() across agents regardless of (tf or pytorch, multi-agent, multi-gpu, or algos that use >1 SGD iterations, e.g. ppo) (#18879)

Ray Workflow

🎉 New Features:

Introduce workflow.delete (#19178)

🔨Fixes:

Fix the bug which allow workflow step to be executed multiple times (#19090)

🏗 Architecture refactoring:

Object reference serialization is decoupled from workflow storage (#18328)

Tune

🎉 New Features:

PBT: Add burn-in period (#19321)

💫Enhancements:

Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once (#19144)

Use queue to display JupyterNotebookReporter updates in Ray client (#19137)

Add resume="AUTO" and enhance resume error messages (#19181)

Provide information about resource deadlocks, early stopping in Tune docs (#18947)

Fix HEBOSearch installation docs (#18861)

OptunaSearch: check compatibility of search space with evaluated_rewards (#18625)

Add save and restore methods for searchers that were missing it & test (#18760)

Add documentation for reproducible runs (setting seeds) (#18849)

Depreciate max_concurrent in TuneBOHB (#18770)

Add on_trial_result to ConcurrencyLimiter (#18766)

Ensure arguments passed to tune remote_run match (#18733)

Only disable ipython in remote actors (#18789)

🔨Fixes:

Only try to sync driver if sync_to_driver is actually enabled (#19589)

sync_client: Fix delete template formatting (#19553)

Force no result buffering for hyperband schedulers (#19140)

Exclude trial checkpoints in experiment sync (#19185)

Fix how durable trainable is retained in global registry (#19223, #19184)

Ensure loc column in progress reporter is filled (#19182)

Deflake PBT Async test (#19135)

Fix Analysis.dataframe() documentation and enable passing of mode=None (#18850)

Ray Train (SGD)

Ray SGD has been rebranded to Ray Train! The new documentation landing page can be found here. Ray Train is integrated with Ray Datasets for distributed data loading while training, documentation available here.

🎉 New Features:

Ray Datasets Integration (#17626)

🔨Fixes:

Improved support for multi-GPU training (#18824, #18958)

Make actor creation async (#19325)

📖Documentation:

Rename Ray SGD v2 to Ray Train (#19436)

Added migration guide from Ray SGD v1 (#18887)

Serve

🎉 New Features:

Add ability to recover from a checkpoint on cluster failure (#19125)

Support kwargs to deployment constructors (#19023)

🔨Fixes:

Fix asyncio compatibility issue (#19298)

Catch spurious ConnectionErrors during shutdown (#19224)

Fix error with uris=None in runtime_env (#18874)

Fix shutdown logic with exit_forever (#18820)

🏗 Architecture refactoring:

Progress towards Serve autoscaling (#18793, #19038, #19145)

Progress towards Java support (#18630)

Simplifications for long polling (#19154, #19205)

Dashboard

🎉 New Features:

Basic support for the dashboard on Windows (#19319)

🔨Fixes:

Fix healthcheck issue causing the dashboard to crash under load (#19360)

Work around aiohttp 4.0.0+ issues (#19120)

🏗 Architecture refactoring:

Improve dashboard agent retry logic (#18973)

Thanks

Many thanks to all those who contributed to this release! @rkooo567, @lchu-ibm, @scv119, @pdames, @suquark, @antoine-galataud, @sven1977, @mvindiola1, @krfricke, @ijrsvt, @sighingnow, @marload, @jmakov, @clay4444, @mwtian, @pcmoritz, @iycheng, @ckw017, @chenk008, @jovany-wang, @jjyao, @hauntsaninja, @franklsf95, @jiaodong, @wuisawesome, @odp, @matthewdeng, @duarteocarmo, @czgdp1807, @gjoliver, @mattip, @richardliaw, @max0x7ba, @Jasha10, @acxz, @xwjiang2010, @SongGuyang, @simon-mo, @zhisbug, @ccssmnn, @Yard1, @hazeone, @o0olele, @froody, @robertnishihara, @amogkam, @sasha-s, @xychu, @lixin-wei, @architkulkarni, @edoakes, @clarkzinzow, @DmitriGekhtman, @avnishn, @liuyang-my, @stephanie-wang, @Chong-Li, @ericl, @juliusfrost, @carlogrisetti
Source code(tar.gz)
Source code(zip)
ray-1.7.0(Oct 7, 2021)
Highlights

Ray SGD v2 is now in alpha! The v2 version introduces APIs that focus on ease of use and composability. Check out the docs here, and the migration guide from v1 to v2 here.

If you are using Ray SGD v2, we’d love to hear your feedback here!

Ray Workflows is now in alpha! Check out the docs here and try it out for your large-scale data science, ML, and long-running business workflows. Thanks to our early adopters for the feedback so far and the ongoing contributions from IBM Research.

We have made major enhancements to C++ API! While we are still busy hardening the feature for production usage, please check out the docs here, try it out, and help provide feedback!

Ray Autoscaler

💫Enhancements:

Improvement to logging and code structure #18180

Default head node type to 0 max_workers #17757

Modifications to accommodate custom node providers #17312

🔨 Fixes:

Helm chart configuration fixes #17678 #18123

GCP autoscaler config fix #18653

Allow attaching to uninitialized head node for debugging #17688

Syncing files with Docker head node fixed #16515

Ray Client

🎉 New Features:

ray.init() args can be forwarded to remote server (#17776)

Allow multiple client connections from one driver (#17942)

gRPC channel credentials can now be configured from ray.init (#18425, #18365)

Ray Client will attempt to recover connections on certain gRPC failures (#18329)

💫Enhancements

Less confusing client RPC errors (#18278)

Use a single RPC to fetch ClientObjectRefs passed in a list (#16944)

Increase timeout for ProxyManager.get_channel (#18350)

🔨 Fixes:

Fix mismatched debug log ID formats (#17597)

Fix confusing error messages when client scripts exit (#17969)

Ray Core

🎉 New Features:

Major enhancements in the C++ API!

This API library enables you to build a C++ distributed system easily, just like the Python API and the Java API.

Run pip install -U ray[cpp] to install Ray with C++ API support.

Run ray cpp --help to learn how to use it.

For more details, check out the docs here and see the tab “C++”.

🔨 Fixes:

Bug fixes for thread-safety / reference count issues / placement group (#18401, #18746, #18312, #17802, #18526, #17863, #18419, #18463, #18193, #17774, #17772, #17670, #17620, #18584, #18646, #17634, #17732)

Better format for object loss errors / task & actor logs (#18742, #18577, #18105, #18292, #17971, #18166)

Improved the ray status output for placement groups (#18289, #17892)

Improved the function export performance (#18284)

Support more Ray core metrics such as RPC call latencies (#17578)

Improved error messages and logging for runtime environments (#18451, #18092, #18088, #18084, #18496, #18083)

Ray Data Processing

🎉 New Features:

Add support for reading partitioned Parquet datasets (#17716)

Add dataset unioning (#17793)

Add support for splitting a dataset at row indices (#17990)

Add from_numpy() and to_numpy() APIs (#18146)

Add support for splitting a dataset pipeline at row indices (#18243)

Add Modin integration (from_modin() and to_modin()) (#18122)

Add support for datasets with tensor columns (#18301)

Add RayDP (Spark-on-Ray) integration (from_spark() and to_spark()) (#17340)

💫Enhancements

Drop empty tables when read Parquet fragments in order to properly support filter expressions when reading partitioned Parquet datasets (#18098)

Retry application-level errors in Datasets (#18296)

Create a directory on write if it doesn’t exist (#18435)

URL encode paths if they are URLs (#18440)

Guard against a dataset pipeline being read multiple times on accident (#18682)

Reduce working set size during random shuffles by eagerly destroying intermediate datasets (#18678)

Add manual round-robin resource-based load balancing option to read and shuffle stages (#18678)

🔨 Fixes:

Fix JSON writing so IO roundtrip works (#17691)

Fix schema subsetting on column selection during Parquet reads (#18361)

Fix Dataset.iter_batches() dropping batches when prefetching (#18441)

Fix filesystem inference on path containing space (#18644)

🏗 Architecture refactoring:

Port write side of IO layer to use file-based datasources (#18135)

RLlib

🎉 New Features:

Replay buffers: Add config option to store contents in checkpoints (store_buffer_in_checkpoints=True). (#17999)

Add support for multi-GPU to DDPG. (#17789)

💫Enhancements:

Support for running evaluation and training in parallel, thereby only evaluating as many episodes as the training loop takes (evaluation_num_episodes=”auto”). (#18380)

Enhanced stability: Started nightly multi-GPU (2) learning tests for most algos (tf + torch), including LSTM and attention net setups.

🏗 Architecture refactoring:

Make MultiAgentEnv inherit gym.Env to avoid direct class type manipulation (#18156)

SampleBatch: Add support for nested data (+Docstring- and API cleanups). (#17485)

Add policies arg to callback: on_episode_step (already exists in all other episode-related callbacks) (#18119)

Add worker arg (optional) to policy_mapping_fn. (#18184)

🔨 Fixes:

Fix Atari learning test regressions (2 bugs) and 1 minor attention net bug. (#18306)

Fix n-step > 1 postprocessing bug (issues 17844, 18034). (#18358)

Fix crash when using StochasticSampling exploration (most PG-style algos) w/ tf and numpy version > 1.19.5 (#18366)

Strictly run evaluation_num_episodes episodes each evaluation run (no matter the other eval config settings). (#18335)

Issue 17706: AttributeError: 'numpy.ndarray' object has no attribute 'items'" on certain turn-based MultiAgentEnvs with Dict obs space. (#17735)

Issue 17900: Set seed in single vectorized sub-envs properly, if num_envs_per_worker > 1 (#18110)

Fix R2D2 (torch) multi-GPU issue. (#18550)

Fix final_scale's default value to 0.02 (see OrnsteinUhlenbeck exploration). (#18070)

Ape-X doesn't take the value of prioritized_replay into account (#17541)

Issue 17653: Torch multi-GPU (>1) broken for LSTMs. (#17657)

Issue 17667: CQL-torch + GPU not working (due to simple_optimizer=False; must use simple optimizer!). (#17742)

Add locking to PolicyMap in case it is accessed by a RolloutWorker and the same worker's AsyncSampler or the main LearnerThread. (#18444)

Other fixes and enhancements: #18591, #18381, #18670, #18705, #18274, #18073, #18017, #18389, #17896, #17410, #17891, #18368, #17778, #18494, #18466, #17705, #17690, #18254, #17701, #18544, #17889, #18390, #18428, #17821, #17955, #17666, #18423, #18040, #17867, #17583, #17822, #18249, #18155, #18065, #18540, #18367, #17960, #17895, #18467, #17928, #17485, #18307, #18043, #17640, #17702, #15849, #18340

Tune

💫Enhancements:

Usability improvements when trials appear to be stuck in PENDING state forever when the cluster has insufficient resources. (#18611, #17957, #17533)

Searchers and Tune Callbacks now have access to some experiment settings information. (#17724, #17794)

Improve HyperOpt KeyError message when metric was not found. (#18549)

Allow users to configure bootstrap for docker syncer. (#17786)

Allow users to update trial resources on resume. (#17975)

Add max_concurrent_trials argument to tune.run. (#17905)

Type hint TrialExecutor. Use Abstract Base Class. (#17584)

Add developer/stability annotations. (#17442)

🔨Fixes:

Placement group stability issues. (#18706, #18391, #18338)

Fix a DurableTrainable checkpointing bug. (#18318)

Fix a trial reset bug if a RLlib algorithm with default resources is used. (#18209)

Fix hyperopt points to evaluate for nested lists. (#18113)

Correctly validate initial points for random search. (#17282)

Fix local mode. Add explicit concurrency limiter for local mode. (#18023)

Sanitize trial checkpoint filename. (#17985)

Explicitly instantiate skopt categorical spaces. (#18005)

SGD (v2)

Ray SGD v2 is now in Alpha! The v2 version introduces APIs that focus on ease of use and composability. Check out the docs here, and the migration guide from v1 to v2 here. If you are using Ray SGD v2, we’d love to hear your feedback here!

🎉 New Features:

Ray SGD v2

Horovod Backend (#18047)

JSON Callback (#17619) and Tensorboard Callback (#17824)

Checkpointing Support (#17632, #17807)

Fault Tolerance (#18090)

Integration with Ray Tune (#17839, #18179)

Custom resources per worker (#18327)

Low-level Stateful Class API (#18728)

📖 Documentation:

Ray SGD: Deep Learning on Ray (#18270)

Serve

↗️Deprecation and API changes:

serve.start(http_host=..., http_port=..., http_middlewares=...) has been deprecated since Ray 1.2.0. They are now removed in favor of serve.start(http_options={“host”: …, “port”: …, “middlewares”: …). (#17762)

Remove deprecated ServeRequest API (#18120)

Remove deprecated endpoints API (#17989)

🎉 New Features:

Serve checkpoint with cluster failure recovery from disk and S3 (#17622, #18293, #18657)

🔨Fixes:

Better serve constructor failure handling (#16922, #18402)

Fix get_handle execution from threads (#18198)

Remove requirement to specify namespace for serve.start(detached=True) (#17470)

🏗 Architecture refactoring:

Progress towards replica autoscaling (#18658)

Dashboard

🎉 New Features:

Ray system events are now published in experimental dashboard (#18330, pop #18698)

Actor page will now show actors with PENDING_CREATION status (#18666)

Thanks

Many thanks to all those who contributed to this release! @scottsun94, @hngenc, @iycheng, @asm582, @jkterry1, @ericl, @thomasdesr, @ryanlmelvin, @ellimac54, @Bam4d, @gjoliver, @juliusfrost, @simon-mo, @ashione, @RaphaelCS, @simonsays1980, @suquark, @jjyao, @lixin-wei, @77loopin, @Ivorforce, @DmitriGekhtman, @dependabot[bot], @souravraha, @robertnishihara, @richardliaw, @SongGuyang, @rkooo567, @edoakes, @jsuarez5341, @zhisbug, @clarkzinzow, @triciasfu, @architkulkarni, @akern40, @liuyang-my, @krfricke, @amogkam, @Jingyu-Peng, @xwjiang2010, @nikitavemuri, @hauntsaninja, @fyrestone, @navneet066, @ijrsvt, @mwtian, @sasha-s, @raulchen, @holdenk, @qicosmos, @Yard1, @yuduber, @mguarin0, @MissiontoMars, @stephanie-wang, @stefanbschneider, @sven1977, @AmeerHajAli, @matthewdeng, @chenk008, @jiaodong, @clay4444, @ckw017, @tchordia, @ThomasLecat, @Chong-Li, @jmakov, @jovany-wang, @tdhopper, @kfstorm, @wgifford, @mxz96102, @WangTaoTheTonic, @lada-kunc, @scv119, @kira-lin, @wuisawesome
Source code(tar.gz)
Source code(zip)
ray-1.6.0(Aug 23, 2021)
Highlights

Runtime Environments are ready for general use! This feature enables you to dynamically specify per-task, per-actor and per-job dependencies, including a working directory, environment variables, pip packages and conda environments. Install it with pip install -U 'ray[default]'.

Ray Dataset is now in alpha! Dataset is an interchange format for distributed datasets, powered by Arrow. You can also use it for a basic Ray native data processing experience. Check it out here.

Ray Lightning v0.1 has been released! You can install it via pip install ray-lightning. Ray Lightning is a library of PyTorch Lightning plugins for distributed training using Ray. Features:

Enables quick and easy parallel training

Supports PyTorch DDP, Horovod, and Sharded DDP with Fairscale

Integrates with Ray Tune for hyperparameter optimization and is compatible with Ray Client

pip install ray now has a significantly reduced set of dependencies. Features such as the dashboard, the cluster launcher, runtime environments, and observability metrics may require pip install -U 'ray[default]' to be enabled. Please report any issues on Github if this is an issue!

Ray Autoscaler

🎉 New Features:

The Ray autoscaler now supports TPUs on GCP. Please refer to this example for spinning up a simple TPU cluster. (#17278)

💫Enhancements:

Better AWS networking configurability (#17236 #17207 #14080)

Support for running autoscaler without NodeUpdaters (#17194, #17328)

🔨 Fixes:

Code clean up and corrections to downscaling policy (#17352)

Docker file sync fix (#17361)

Ray Client

💫Enhancements:

Updated docs for client server ports and ray.init(ray://) (#17003, #17333)

Better error handling for deserialization failures (#17035)

🔨 Fixes:

Fix for server proxy not working with non-default redis passwords (#16885)

Ray Core

🎉 New Features:

Runtime Environments are ready for general use!

Specify a working directory to upload your local files to all nodes in your cluster.

Specify different conda and pip dependencies for your tasks and actors and have them installed on the fly.

🔨 Fixes:

Fix plasma store bugs for better data processing stability (#16976, #17135, #17140, #17187, #17204, #17234, #17396, #17550)

Fix a placement group bug where CUDA_VISIBLE_DEVICES were not properly detected (#17318)

Improved Ray stacktrace messages. (#17389)

Improved GCS stability and scalability (#17456, #17373, #17334, #17238, #17072)

🏗 Architecture refactoring:

Plasma store refactor for better testability and extensibility. (#17332, #17313, #17307)

Ray Data Processing

Ray Dataset is now in alpha! Dataset is an interchange format for distributed datasets, powered by Arrow. You can also use it for a basic Ray native data processing experience. Check it out here.

RLLib

🎉 New Features:

Support for RNN/LSTM models with SAC (new agent: "RNNSAC"). Shoutout to ddworak94! (#16577)

Support for ONNX model export (tf and torch). (#16805)

Allow Policies to be added to/removed from a Trainer on-the-fly. (#17566)

🔨 Fixes:

Fix for view requirements captured during compute actions test pass. Shoutout to Chris Bamford (#15856)

Issues: 17397, 17425, 16715, 17174. When on driver, Torch|TFPolicy should not use ray.get_gpu_ids() (b/c no GPUs assigned by ray). (#17444)

Other bug fixes: #15709, #15911, #16083, #16716, #16744, #16896, #16999, #17010, #17014, #17118, #17160, #17315, #17321, #17335, #17341, #17356, #17460, #17543, #17567, #17587

🏗 Architecture refactoring:

CV2 to Skimage dependency change (CV2 still supported). Shoutout to Vince Jankovics. (#16841)

Unify tf and torch policies wrt. multi-GPU handling: PPO-torch is now 33% faster on Atari and 1 GPU. (#17371)

Implement all policy maps inside RolloutWorkers to be LRU-caches so that a large number of policies can be added on-the-fly w/o running out of memory. (#17031)

Move all tf static-graph code into DynamicTFPolicy, such that policies can be deleted and their tf-graph is GC'd. (#17169)

Simplify multi-agent configs: In most cases, creating dummy envs (only to retrieve spaces) are no longer necessary. (#16565, #17046)

📖Documentation:

Examples scripts do-over (shoutout to Stefan Schneider for this initiative).

Example script: League-based self-play with "open spiel" env. (#17077)

Other doc improvements: #15664 (shoutout to kk-55), #17030, #17530

Tune

🎉 New Features:

Dynamic trial resource allocation with ResourceChangingScheduler (#16787)

It is now possible to use a define-by-run function to generate a search space with OptunaSearcher (#17464)

💫Enhancements:

String names of searchers/schedulers can now be used directly in tune.run (#17517)

Filter placement group resources if not in use (progress reporting) (#16996)

Add unit tests for flatten_dict (#17241)

🔨Fixes:

Fix HDFS sync down template (#17291)

Re-enable TensorboardX without Torch installed (#17403)

📖Documentation:

LightGBM integration (#17304)

Other documentation improvements: #17407 (shoutout to amavilla), #17441, #17539, #17503

SGD

🎉 New Features:

We have started initial development on a new RaySGD v2! We will be rolling it out in a future version of Ray. See the documentation here. (#17536, #17623, #17357, #17330, #17532, #17440, #17447, #17300, #17253)

💫Enhancements:

Placement Group support for TorchTrainer (#17037)

Serve

🎉 New Features:

Add Ray API stability annotations to Serve, marking many serve.\* APIs as Stable (#17295)

Support runtime_env's working_dir for Ray Serve (#16480)

🔨Fixes:

Fix FastAPI's response_model not added to class based view routes (#17376)

Replace backend with deployment in metrics & logging (#17434)

🏗Stability Enhancements:

Run Ray Serve with multi & single deployment large scale (1K+ cores) test running nightly (#17310, #17411, #17368, #17026, #17277)

Thanks

Many thanks to all who contributed to this release:

@suquark, @xwjiang2010, @clarkzinzow, @kk-55, @mGalarnyk, @pdames, @Souphis, @edoakes, @sasha-s, @iycheng, @stephanie-wang, @antoine-galataud, @scv119, @ericl, @amogkam, @ckw017, @wuisawesome, @krfricke, @vakker, @qingyun-wu, @Yard1, @juliusfrost, @DmitriGekhtman, @clay4444, @mwtian, @corentinmarek, @matthewdeng, @simon-mo, @pcmoritz, @qicosmos, @architkulkarni, @rkooo567, @navneet066, @dependabot[bot], @jovany-wang, @kombuchafox, @thomasjpfan, @kimikuri, @Ivorforce, @franklsf95, @MissiontoMars, @lantian-xu, @duburcqa, @ddworak94, @ijrsvt, @sven1977, @kira-lin, @SongGuyang, @kfstorm, @Rohan138, @jamesmishra, @amavilla, @fyrestone, @lixin-wei, @stefanbschneider, @jiaodong, @richardliaw, @WangTaoTheTonic, @chenk008, @Catch-Bull, @Bam4d
Source code(tar.gz)
Source code(zip)
ray-1.5.2(Aug 12, 2021)

Cherrypick release to address RLlib issue, no library or core changes included.
Source code(tar.gz)
Source code(zip)
ray-1.5.1(Jul 31, 2021)

Cherrypick release to address a few external integration and documentation issues, no library or core changes included.
Source code(tar.gz)
Source code(zip)
ray-1.5.0(Jul 26, 2021)
Ray 1.5.0 Release Note

Highlight

Ray Datasets is now in alpha (https://docs.ray.io/en/master/data/dataset.html)

LightGBM on Ray is now in beta (https://github.com/ray-project/lightgbm_ray).

enables multi-node and multi-GPU training

integrates seamlessly with distributed hyperparameter optimization library Ray Tune

comes with fault tolerance handling mechanisms, and

supports distributed dataframes and distributed data loading

Ray Autoscaler

🎉 New Features:

Aliyun support (#15712)

💫 Enhancements:

[Kubernetes] Operator refactored to use Kopf package (#15787)

Flag to control config bootstrap for rsync (#16667)

Prometheus metrics for Autoscaler (#16066, #16198)

Allows launching in subnets where public IP assignments off by default (#16816)

🔨 Fixes:

[Kubernetes] Fix GPU=0 resource handling (#16887)

[Kubernetes] Release docs updated with K8s test instructions (#16662)

[Kubernetes] Documentation update (#16570)

[Kubernetes] All official images set to rayproject/ray:latest (#15988 #16205)

[Local] Fix bootstrapping ray at a given static set of ips (#16202, #16281)

[Azure] Fix Azure Autoscaling Failures (#16640)

Handle node type key change / deletion (#16691)

[GCP] Retry GCP BrokenPipeError (#16952)

Ray Client

🎉 New Features:

Client integrations with major Ray Libraries (#15932, #15996, #16103, #16034, #16029, #16111, #16301)

Client Connect now returns a context that hasdisconnect and can be used as a context manager (#16021)

💫 Enhancements:

Better support for multi-threaded client-side applications (#16731, #16732)

Improved error messages and warnings when misusing Ray Client (#16454, #16508, #16588, #16163)

Made Client Object & Actor refs a subclass of their non-client counterparts (#16110)

🔨 Fixes:

dir() Works for client-side Actor Handles (#16157)

Avoid server-side time-outs (#16554)

Various fixes to the client-server proxy (#16040, #16038, #16057, #16180)

Ray Core

🎉 New Features:

Ray dataset alpha is available!

🔨 Fixes:

Fix various Ray IO layer issues that fixes hanging & high memory usage (#16408, #16422, #16620, #16824, #16791, #16487, #16407, #16334, #16167, #16153, #16314, #15955, #15775)

Namespace now properly isolates placement groups (#16000)

More efficient object transfer for spilled objects (#16364, #16352)

🏗 Architecture refactoring:

From Ray 1.5.0, liveness of Ray jobs are guaranteed as long as there’s enough disk space in machines with the “fallback allocator” mechanism which allocates plasma objects to the disk directly when objects cannot be created in memory or spilled to the disk.

RLlib

🎉 New Features:

Support for adding/deleting Policies to a Trainer on-the-fly (#16359, #16569, #16927).

Added new “input API” for customizing offline datasets (shoutout to Julius F.). (#16957)

Allow for external env PolicyServer to listen on n different ports (given n rollout workers); No longer require creating an env on the server side to get env’s spaces. (#16583).

🔨 Fixes:

CQL: Bug fixes and clean-ups (fixed iteration count). (#16531, #16332)

D4RL: #16721

ensure curiosity exploration actions are passed in as tf tensors (shoutout to Manny V.). (#15704)

Other bug fixes and cleanups: #16162 and #16309 (shoutout to Chris B.), #15634, #16133, #16860, #16813, #16428, #16867, #16354, #16218, #16118, #16429, #16427, #16774, #16734, #16019, #16171, #16830, #16722

📖 Documentation and testing:

#16311, #15908, #16271, #16080, #16740, #16843

🏗 Architecture refactoring:

All RLlib algos operating on Box action spaces now operate on normalized actions by default (ranging from -1.0 to 1.0). This enables PG-style algos to learn in skewed action spaces. (#16531)

Tune

🎉 New Features:

New integration with LightGBM via Tune callbacks (#16713).

New cost-efficient HPO searchers (BlendSearch and CFO) available from the FLAML library (https://github.com/microsoft/FLAML). (#16329)

💫 Enhancements:

Pass in configurations that have already been evaluated separately to Searchers. This is useful for warm-starting or for meta-searchers, for example (#16485)

Sort trials in reporter table by metric (#16576)

Add option to keep random values constant over grid search (#16501)

Read trial results from json file (#15915)

🔨 Fixes:

Fix infinite loop when using Searcher that limits concurrency internally in conjunction with a ConcurrencyLimiter (#16416)

Allow custom sync configuration with DurableTrainable (#16739)

Logger fixes. W&B: #16806, #16674, #16839. MLflow: #16840

Various bug fixes: #16844, #16017, #16575, #16675, #16504, #15811, #15899, #16128, #16396, #16695, #16611

📖 Documentation and testing:

Use BayesOpt for quick start example (#16997)

#16793, #16029, #15932, #16980, #16450, #16709, #15913, #16754, #16619

SGD

🎉 New Features:

Torch native mixed precision is now supported! (#16382)

🔨 Fixes:

Use target label count for training batch size (#16400)

📖 Documentation and testing:

#15999, #16111, #16301, #16046

Serve

💫 Enhancements: UX improvements (#16227, #15909), Improved logging (#16468) 🔨 Fixes: Fix shutdown logic (#16524), Assorted bug fixes (#16647, #16760, #16783) 📖 Documentation and testing: #16042, #16631, #16759, #16786

Thanks

Many thanks to all who contributed to this release:

@Tonyhao96, @simon-mo, @scv119, @Yard1, @llan-ml, @xcharleslin, @jovany-wang, @ijrsvt, @max0x7ba, @annaluo676, @rajagurunath, @zuston, @amogkam, @yorickvanzweeden, @mxz96102, @chenk008, @Bam4d, @mGalarnyk, @kfstorm, @crdnb, @suquark, @ericl, @marload, @jiaodong, @thexiang, @ellimac54, @qicosmos, @mwtian, @jkterry1, @sven1977, @howardlau1999, @mvindiola1, @stefanbschneider, @juliusfrost, @krfricke, @matthewdeng, @zhuangzhuang131419, @brandonJY, @Eleven1Liu, @nikitavemuri, @richardliaw, @iycheng, @stephanie-wang, @HuangLED, @clarkzinzow, @fyrestone, @asm582, @qingyun-wu, @ckw017, @yncxcw, @DmitriGekhtman, @benjamindkilleen, @Chong-Li, @kathryn-zhou, @pcmoritz, @rodrigodelazcano, @edoakes, @dependabot[bot], @pdames, @frenkowski, @loicsacre, @gabrieleoliaro, @achals, @thomasjpfan, @rkooo567, @dibgerge, @clay4444, @architkulkarni, @lixin-wei, @ConeyLiu, @WangTaoTheTonic, @AnnaKosiorek, @wuisawesome, @gramhagen, @zhisbug, @franklsf95, @vakker, @jenhaoyang, @liuyang-my, @chaokunyang, @SongGuyang, @tgaddair
Source code(tar.gz)
Source code(zip)
ray-1.4.1(Jun 30, 2021)
Release 1.4.1 Notes

Ray Python Wheels

Python 3.9 wheels (Linux / MacOS / Windows) are available (#16347 #16586)

Ray Autoscaler

🔨 Fixes: On-prem bug resolved (#16281)

Ray Client

💫Enhancements:

Add warnings when many tasks scheduled (#16454)

Better error messages (#16163)

🔨 Fixes:

Fix gRPC Timeout Options (#16554)

Disconnect on dataclient error (#16588)

Ray Core

🔨 Fixes:

Runtime Environments

Docs (#16290)

Bug fixes (#16475, #16535, #16378)

Logging improvement (#16516)

Fix race condition leading to failed imports #16278

Don't broadcast empty resources data (#16104)

Fix async actor lost object bug (#16414)

Always report job timestamps in milliseconds (#16455, #16545, #16548)

Multi-node placement group and job config bug fixes (#16345)

Fix bug in task dependency management for duplicate args (#16365)

Unify Python and core worker ids (#16712)

Dask

💫Enhancements: Dask 2021.06.1 support (#16547)

Tune

💫Enhancements: Support object refs in with_params (#16753)

Serve

🔨Fixes: Ray serve shutdown goes through Serve controller (#16524)

Java

🔨Fixes: Upgrade dependencies to fix CVEs (#16650, #16657)

Documentation

Runtime Environments (#16290)

Feature contribution [Tune] (#16477)

Ray design patterns and anti-patterns (#16478)

PyTorch Lightning (#16484)

Ray Client (#16497)

Ray Deployment (#16538)

Dask version compatibility (#16595)

CI

Move wheel and Docker image upload from Travis to Buildkite (#16138 #16241)

Thanks

Many thanks to all those who contributed to this release!

@rkooo567, @clarkzinzow, @WangTaoTheTonic, @ckw017, @stephanie-wang, @Yard1, @mwtian, @jovany-wang, @jiaodong, @wuisawesome, @krfricke, @architkulkarni, @ijrsvt, @simon-mo, @DmitriGekhtman, @amogkam, @richardliaw
Source code(tar.gz)
Source code(zip)
ray-1.4.0(Jun 7, 2021)
Release 1.4.0 Notes

Ray Autoscaler

🎉 New Features:

Support Helm Chart for deploying Ray on Kubernetes

Key Autoscaler metrics are now exported via Prometheus!

💫Enhancements

Better error messages when a node fails to come online

🔨 Fixes:

Stability and interface fixes for Kubernetes deployments.

Fixes to Azure NodeProvider

Ray Client

🎉 New Features:

Complete API parity with non-client mode

Experimental ClientBuilder API (docs here)

Full Asyncio support

💫Enhancements

Keep Alive for Messages for long lived connections

Improved pickling error messages

🔨 Fixes:

Client Disconnect can be called multiple times

Client Reference Equality Check

Many bug fixes and tests for the complete ray API!

Ray Core

🎉 New Features:

Namespaces (check out the docs)! Note: this may be a breaking change if you’re using detached actors (set ray.init(namespace=””) for backwards compatible behavior).

🔨 Fixes:

Support increment by arbitrary number with ray.util.metrics.Counter

Various bug fixes for the placement group APIs including the GPU assignment bug (#15049).

🏗 Architecture refactoring:

Increase the efficiency and robustness of resource reporting

Ray Data Processing

🔨 Fixes:

Various bug fixes for better stability (#16063, #14821, #15669, #15757, #15431, #15426, #15034, #15071, #15070, #15008, #15955)

Fixed a critical bug where the driver uses excessive memory usage when there are many objects in the cluster (#14322).

Dask on Ray and Modin can now be run with Ray client

🏗 Architecture refactoring:

Ray 100TB shuffle results: https://github.com/ray-project/ray/issues/15770

More robust memory management subsystem is in progress (#15157, #15027)

RLlib

🎉 New Features:

PyTorch multi-GPU support (#14709, #15492, #15421).

CQL TensorFlow support (#15841).

Task-settable Env/Curriculum Learning API (#15740).

Support for native tf.keras Models (no ModelV2 required) (#14684, #15273).

Trainer.train() and Trainer.evaluate() can run in parallel (optional) (#15040, #15345).

💫Enhancements and documentation:

CQL: Bug fixes and confirmed MuJoCo benchmarks (#15814, #15603, #15761).

Example for differentiable neural computer (DNC) network (#14844, 15939).

Added support for int-Box action spaces. (#15012)

DDPG/TD3/A[23]C/MARWIL/BC: Code cleanup and type annotations. (#14707).

Example script for restoring 1 agent out of n

Examples for fractional GPU usage. (15334)

Enhanced documentation page describing example scripts and blog posts (15763).

Various enhancements/test coverage improvements: 15499, 15454, 15335, 14865, 15525, 15290, 15611, 14801, 14903, 15735, 15631,

🔨 Fixes:

Memory Leak in multi-agent environment (#15815). Shoutout to Bam4d!

DDPG PyTorch GPU bug. (#16133)

Simple optimizer should not be used by default for tf+MA (#15365)

Various bug fixes: #15762, 14843, 15042, 15427, 15871, 15132, 14840, 14386, 15014, 14737, 15015, 15733, 15737, 15736, 15898, 16118, 15020, 15218, 15451, 15538, 15610, 15326, 15295, 15762, 15436, 15558, 15937

🏗 Architecture refactoring:

Remove atari dependency (#15292).

Trainer._evaluate() renamed to Trainer.evaluate() (backward compatible); Trainer.evaluate() can be called even w/o evaluation worker set, if create_env_on_driver=True (#15591).

Tune

🎉 New Features:

ASHA scheduler now supports save/restore. (#15438)

Add HEBO to search algorithm shim function (#15468)

Add SkoptSearcher/Bayesopt Searcher restore functionality (#15075)

💫Enhancements:

We now document scalability best practices (k8s, scalability thresholds). You can find this here (#14566)

You can now set the result buffer_length via tune.run - this helps with trials that report too frequently. (#15810)

Support numpy types in TBXlogger (#15760)

Add max_concurrent option to BasicVariantGenerator (#15680)

Add seed parameter to OptunaSearch (#15248)

Improve BOHB/ConfigSpace dependency check (#15064)

🔨Fixes:

Reduce default number of maximum pending trials to max(16, cluster_cpus) (#15628)

Return normalized checkpoint path (#15296)

Escape paths before globbing in TrainableUtil.get_checkpoints_paths (#15368)

Optuna Searcher: Set correct Optuna TrialState on trial complete (#15283)

Fix type annotation in tune.choice (#15038)

Avoid system exit error by using del when cleaning up actors (#15687)

Serve

🎉 New Features:

As of Ray 1.4, Serve has a new API centered around the concept of “Deployments.” Deployments offer a more streamlined API and can be declaratively updated, which should improve both development and production workflows. The existing APIs have not changed from Ray 1.4 and will continue to work until Ray 1.5, at which point they will be removed (see the package reference if you’re not sure about a specific API). Please see the migration guide for details on how to update your existing Serve application to use this new API.

New serve.deployment API: @serve.deployment, serve.get_deployments, serve.list_deployments (#14935, #15172, #15124, #15121, #14953, #15152, #15821)

New serve.ingress(fastapi_app) API (#15445, 15441, 14858)

New @serve.batch decorator in favor of legacy max_batch_size in backend config (#15065)

serve.start() is now idempotent (#15148)

Added support for handle.method_name.remote() (#14831)

🔨Fixes:

Rolling updates for redeployments (#14803)

Latency improvement by using pickle (#15945)

Controller and HTTP proxy uses num_cpus=0 by default (#15000)

Health checking in the controller instead of using max_restarts (#15047)

Use longest prefix matching for path routing (#15041)

Dashboard

🎉New Features:

Experimental OpenTelemetry support. (#16028,#14872,#15742).

🔨Fixes:

Add object store memory column (#15697)

Add object store stats to dashboard API. (#15677)

Remove disk data from the dashboard when running on K8s. (#14676)

Fix reported dashboard ip when using 0.0.0.0 (#15506)

Thanks

Many thanks to all those who contributed to this release!

@clay4444, @Fabien-Couthouis, @mGalarnyk, @smorad, @ckw017, @ericl, @antoine-galataud, @pleiadesian, @DmitriGekhtman, @robertnishihara, @Bam4d, @fyrestone, @stephanie-wang, @kfstorm, @wuisawesome, @rkooo567, @franklsf95, @micahtyong, @WangTaoTheTonic, @krfricke, @hegdeashwin, @devin-petersohn, @qicosmos, @edoakes, @llan-ml, @ijrsvt, @richardliaw, @Sertingolix, @ffbin, @simjay, @AmeerHajAli, @simon-mo, @tom-doerr, @sven1977, @clarkzinzow, @mxz96102, @SebastianBo1995, @amogkam, @iycheng, @sumanthratna, @Catch-Bull, @pcmoritz, @architkulkarni, @stefanbschneider, @tgaddair, @xcharleslin, @cthoyt, @fcardoso75, @Jeffwan, @mvindiola1, @michaelzhiluo, @rlan, @mwtian, @SongGuyang, @YeahNew, @kathryn-zhou, @rfali, @jennakwon06, @Yeachan-Heo
Source code(tar.gz)
Source code(zip)
ray-1.3.0(Apr 22, 2021)
Release v1.3.0 Notes

Highlights

We are now testing and publishing Ray's scalability limits with each release, see: https://github.com/ray-project/ray/tree/releases/1.3.0/benchmarks

Ray Client is now usable by default with any Ray cluster started by the Ray Cluster Launcher.

Ray Cluster Launcher

💫Enhancements:

Observability improvements (#14816, #14608)

Worker nodes no longer killed on autoscaler failure (#14424)

Better validation for min_workers and max_workers (#13779)

Auto detect memory resource for AWS and K8s (#14567)

On autoscaler failure, propagate error message to drivers (#14219)

Avoid launching GPU nodes when the workload only has CPU tasks (#13776)

Autoscaler/GCS compatibility (#13970, #14046, #14050)

Testing (#14488, #14713)

Migration of configs to multi-node-type format (#13814, #14239)

Better config validation (#14244, #13779)

Node-type max workers defaults infinity (#14201)

🔨 Fixes:

AWS configuration (#14868, #13558, #14083, #13808)

GCP configuration (#14364, #14417)

Azure configuration (#14787, #14750, #14721)

Kubernetes (#14712, #13920, #13720, #14773, #13756, #14567, #13705, #14024, #14499, #14593, #14655)

Other (#14112, #14579, #14002, #13836, #14261, #14286, #14424, #13727, #13966, #14293, #14293, #14718, #14380, #14234, #14484)

Ray Client

💫Enhancements:

Version checks for Python and client protocol (#13722, #13846, #13886, #13926, #14295)

Validate server port number (#14815)

Enable Ray client server by default (#13350, #13429, #13442)

Disconnect ray upon client deactivation (#13919)

Convert Ray objects to Ray client objects (#13639)

Testing (#14617, #14813, #13016, #13961, #14163, #14248, #14630, #14756, #14786)

Documentation (#14422, #14265)

🔨 Fixes:

Hook runtime context (#13750)

Fix mutual recursion (#14122)

Set gRPC max message size (#14063)

Monitor stream errors (#13386)

Fix dependencies (#14654)

Fix ray.get ctrl-c (#14425)

Report error deserialization errors (#13749)

Named actor refcounting fix (#14753)

RayTaskError serialization (#14698)

Multithreading fixes (#14701)

Ray Core

🎉 New Features:

We are now testing and publishing Ray's scalability limits with each release. Check out https://github.com/ray-project/ray/tree/releases/1.3.0/benchmarks.

[alpha] Ray-native Python-based collective communication primitives for Ray clusters with distributed CPUs or GPUs.

🔨 Fixes:

Ray is now using c++14.

Fixed high CPU breaking raylets with heartbeat missing errors (#13963, #14301)

Fixed high CPU issues from raylet during object transfer (#13724)

Improvement in placement group APIs including better Java support (#13821, #13858, #13582, #15049, #13821)

Ray Data Processing

🎉 New Features:

Object spilling is turned on by default. Check out the documentation.

Dask-on-Ray and Spark-on-Ray are fully ready to use. Please try them out and give us feedback!

Dask-on-Ray is now compatible with Dask 2021.4.0.

Dask-on-Ray now works natively with dask.persist().

🔨 Fixes:

Various improvements in object spilling and memory management layer to support large scale data processing (#13649, #14149, #13853, #13729, #14222, #13781, #13737, #14288, #14578, #15027)

lru_evict flag is now deprecated. Recommended solution now is to use object spilling.

🏗 Architecture refactoring:

Various architectural improvements in object spilling and memory management. For more details, check out the whitepaper.

Locality-aware scheduling is turned on by default.

Moved from centralized GCS-based object directory protocol to decentralized owner-to-owner protocol, yielding better cluster scalability.

RLlib

🎉 New Features:

R2D2 implementation for torch and tf. (#13933)

PlacementGroup support (all RLlib algos now return PlacementGroupFactory from Trainer.default_resource_request). (#14289)

Multi-GPU support for tf-DQN/PG/A2C. (#13393)

💫Enhancements:

Documentation: Update documentation for Curiosity's support of continuous actions (#13784); CQL documentation (#14531)

Attention-wrapper works with images and supports prev-n-actions/rewards options. (#14569)

rllib rollout runs in parallel by default via Trainer’s evaluation worker set. (#14208)

Add env rendering (customizable) and video recording options (for non-local mode; >0 workers; +evaluation-workers) and episode media logging. (#14767, #14796)

Allow SAC to use custom models as Q- or policy nets and deprecate "state-preprocessor" for image spaces. (#13522)

Example Scripts: Add coin game env + matrix social dilemma env + tests and examples (shoutout to Maxime Riché!). (#14208); Attention net (#14864); Serve + RLlib. (#14416); Env seed (#14471); Trajectory view API (enhancements and tf2 support). (#13786); Tune trial + checkpoint selection. (#14209)

DDPG: Add support for simplex action space. (#14011)

Others: on_learn_on_batch callback allows custom metrics. (#13584); Add TorchPolicy.export_model(). (#13989)

🔨 Fixes:

Trajectory View API bugs (#13646, #14765, #14037, #14036, #14031, #13555)

Test cases (#14620, #14450, #14384, #13835, #14357, #14243)

Others (#13013, #14569, #13733, #13556, #13988, #14737, #14838, #15272, #13681, #13764, #13519, #14038, #14033, #14034, #14308, #14243)

🏗 Architecture refactoring:

Remove all non-trajectory view API code. (#14860)

Obsolete UsageTrackingDict in favor of SampleBatch. (#13065)

Tune

🎉 New Features:

We added a new searcher HEBOSearcher (#14504, #14246, #13863, #14427)

Tune is now natively compatible with the Ray Client (#13778, #14115, #14280)

Tune now uses Ray’s Placement Groups underneath the hood. This will enable much faster autoscaling and training (for distributed trials) (#13906, #15011, #14313)

💫Enhancements:

Checkpointing improvements (#13376, #13767)

Optuna Search Algorithm improvements (#14731, #14387)

tune.with_parameters now works with Class API (#14532)

🔨Fixes:

BOHB & Hyperband fixes (#14487, #14171)

Nested metrics improvements (#14189, #14375, #14379)

Fix non-deterministic category sampling (#13710)

Type hints (#13684)

Documentation (#14468, #13880, #13740)

Various issues and bug fixes (#14176, #13939, #14392, #13812, #14781, #14150, #14850, #14118, #14388, #14152, #13825, #13936)

SGD

Add fault tolerance during worker startup (#14724)

Serve

🎉 New Features:

Added metadata to default logger in backend replicas (#14251)

Added more metrics for ServeHandle stats (#13640)

Deprecated system-level batching in favor of @serve.batch (#14610, #14648)

Beta support for Serve with Ray client (#14163)

Use placement groups to bypass autoscaler throttling (#13844)

Deprecate client-based API in favor of process-wide singleton (#14696)

Add initial support for FastAPI ingress (#14754)

🔨 Fixes:

Fix ServeHandle serialization (#13695)

🏗 Architecture refactoring:

Refactor BackendState to support backend versioning and add more unit testing (#13870, #14658, #14740, #14748)

Optimize long polling to be per-key (#14335)

Dashboard

🎉 New Features:

Dashboard now supports being served behind a reverse proxy. (#14012)

Disk and network metrics are added to prometheus. (#14144)

💫Enhancements:

Better CPU & memory information on K8s. (#14593, #14499)

Progress towards a new scalable dashboard. (#13790, #11667, #13763,#14333)

Thanks

Many thanks to all those who contributed to this release: @geraint0923, @iycheng, @yurirocha15, @brian-yu, @harryge00, @ijrsvt, @wumuzi520, @suquark, @simon-mo, @clarkzinzow, @RaphaelCS, @FarzanT, @ob, @ashione, @ffbin, @robertnishihara, @SongGuyang, @zhe-thoughts, @rkooo567, @Ezra-H, @acxz, @clay4444, @QuantumMecha, @jirkafajfr, @wuisawesome, @Qstar, @guykhazma, @devin-petersohn, @jeroenboeye, @ConeyLiu, @dependabot[bot], @fyrestone, @micahtyong, @javi-redondo, @Manuscrit, @mxz96102, @EscapeReality846089495, @WangTaoTheTonic, @stanislav-chekmenev, @architkulkarni, @Yard1, @tchordia, @zhisbug, @Bam4d, @niole, @yiranwang52, @thomasjpfan, @DmitriGekhtman, @gabrieleoliaro, @jparkerholder, @kfstorm, @andrew-rosenfeld-ts, @erikerlandson, @Crissman, @raulchen, @sumanthratna, @Catch-Bull, @chaokunyang, @krfricke, @raoul-khour-ts, @sven1977, @kathryn-zhou, @AmeerHajAli, @jovany-wang, @amogkam, @antoine-galataud, @tgaddair, @randxie, @ChaceAshcraft, @ericl, @cassidylaidlaw, @TanjaBayer, @lixin-wei, @lena-kashtelyan, @cathrinS, @qicosmos, @richardliaw, @rmsander, @jCrompton, @mjschock, @pdames, @barakmich, @michaelzhiluo, @stephanie-wang, @edoakes
Source code(tar.gz)
Source code(zip)
ray-1.2.0(Feb 13, 2021)
Release v1.2.0 Notes

Highlights

Ray client is now in beta! Check out more details here: https://docs.ray.io/en/master/ray-client.html XGBoost-Ray is now in beta! Check out more details about this project at https://github.com/ray-project/xgboost_ray.

Check out the Serve migration guide: https://docs.google.com/document/d/1CG4y5WTTc4G_MRQGyjnb_eZ7GK3G9dUX6TNLKLnKRAc/edit

Ray’s C++ support is now in beta: https://docs.ray.io/en/master/#getting-started-with-ray

An alpha version of object spilling is now available: https://docs.ray.io/en/master/memory-management.html#object-spilling

Ray Autoscaler

🎉 New Features:

A new autoscaler output format in monitor.log (#12772, #13561)

Piping autoscaler events to driver logs (#13434)

💫Enhancements

Full support of ray.autoscaler.sdk.request_resources() API (https://docs.ray.io/en/master/cluster/autoscaling.html?highlight=request_resources#ray.autoscaler.sdk.request_resources) .

Make placement groups bypass max launch limit (#13089)

[K8s] Retry getting home directory in command runner. (#12925)

[docker] Pull if image is not present (#13136)

[Autoscaler] Ensure ubuntu is owner of docker host mount folder (#13579)

🔨 Fixes:

Many autoscaler bug fixes (#12952, #12689, #13058, #13671, #13637, #13588, #13505, #13154, #13151, #13138, #13008, #12980, #12918, #12829, #12714, #12661, #13567, #13663, #13623, #13437, #13498, #13472, #13392, #12514, #13325, #13161, #13129, #12987, #13410, #12942, #12868, #12866, #12865, #12098, #12609)

RLLib

🎉 New Features:

Fast Attention Nets (using the trajectory view API) (#12753).

Attention Nets: Full PyTorch support (#12029).

Attention Nets: Support auto-wrapping around default- or custom models by specifying “use_attention=True” in the model’s config. * * * This works completely analogously now to “use_lstm=True”. (#11698)

New Offline RL Algorithm: CQL (based on SAC) (#13118).

MAML: Discrete actions support (added CartPole mass test case).

Support Atari framestacking via the trajectory view API (#13315).

Support for D4RL environments/benchmarks (#13550).

Preliminary work on JAX support (#13077, #13091).

💫 Enhancements:

Rollout lengths: Allow unit to be configured as “agent_steps” in multi-agent settings (default: “env_steps”) (#12420).

TFModelV2: Soft-deprecate register_variables and unify var names wrt TorchModelV2 (#13339, #13363).

📖 Documentation:

Added documentation on Model building API (#13260, #13261).

Added documentation for the trajectory view API. (#12718)

Added documentation for SlateQ (#13266).

Readme.md documentation for almost all algorithms in rllib/agents (#12943, #13035).

Type annotations for the “rllib/execution” folder (#12760, #13036).

🔨 Fixes:

MARWIL and BC: Add grad-clipping config option to stabilize learning (#13455).

A3C: Solve PyTorch- and TF-eager async race condition between calling model and its value function (#13467).

Various issues- and bug fixes (#12619, #12682, #12704, #12706, #12708, #12765, #12786, #12787, #12793, #12832, #12844, #12846, #12915, #12941, #13039, #13040, #13064, #13083, #13121, #13126, #13237, #13238, #13308, #13332, #13397, #13459, #13553). ###🏗 Architecture refactoring:

Env directory has been cleaned up and is now divided in: Core part (rllib/env) with all basic env classes, and rllib/env/wrappers containing third-party wrapper classes (Atari, Unity3D, etc..) (#13082).

Tune

🎉 New Features:

Ray Tune has updated and improved its integration with MLflow. See this blog post for details (#12840, #13301, #13533)

💫 Enhancements

Ray Tune now uses ray.cloudpickle underneath the hood, allowing you to checkpoint large models (>4GB) (#12958).

Using the 'reuse_actors' flag can now speed up training for general Trainable API usage. (#13549)

Ray Tune will now automatically buffer results from trainables, allowing you to use an arbitrary reporting frequency on your training functions. (#13236)

Ray Tune now has a variety of experiment stoppers (#12750)

Ray Tune now supports an integer loguniform search space distribution (#12994)

Ray Tune now has an initial support for the Ray placement group API. (#13370)

The Weights and Bias integration (WandbLogger) now also accepts wandb.data_types.Video (#13169)

The Hyperopt integration (HyperoptSearch) can now directly accept category variables instead of indices (#12715)

Ray Tune now supports experiment checkpointing when using grid search (#13357)

🔨Fixes and Updates

The Optuna integration was updated to support the 2.4.0 API while maintaining backwards compatibility (#13631)

All search algorithms now support points_to_evaluate (#12790, #12916)

PBT Transformers example was updated and improved (#13174, #13131)

The scikit-optimize integration was improved (#12970)

Various bug fixes (#13423, #12785, #13171, #12877, #13255, #13355)

SGD

🔨Fixes and Updates

Fix Docstring for as_trainable (#13173)

Fix process group timeout units (#12477)

Disable Elastic Training by default when using with Tune (#12927)

Serve

🎉 New Features:

Ray Serve backends now accept a Starlette request object instead of a Flask request object (#12852). This is a breaking change, so please read the migration guide.

Ray Serve backends now have the option of returning a Starlette Response object (#12811, #13328). This allows for more customizable responses, including responses with custom status codes.

[Experimental] The new Ray Serve MLflow plugin makes it easy to deploy your MLflow models on Ray Serve. It comes with a Python API and a command-line interface.

Using “ImportedBackend” you can now specify a backend based on a class that is installed in the Python environment that the workers will run in, even if the Python environment of the driver script (the one making the Serve API calls) doesn’t have it installed (#12923).

💫 Enhancements:

Dependency management using conda no longer requires the driver script to be running in an activated conda environment (#13269).

Ray ObjectRef can now be used as argument to serve_handle.remote(...). (#12592)

Backends are now shut down gracefully. You can set the graceful timeout in BackendConfig. (#13028)

📖 Documentation:

A tutorial page has been added for integrating Ray Serve with your existing FastAPI web server or with your existing AIOHTTP web server (#13127).

Documentation has been added for Ray Serve metrics (#13096).

Source code(tar.gz)
Source code(zip)
ray-1.1.0(Dec 24, 2020)
Ray 1.1.0

Ray Core

🎉 New Features:

Progress towards supporting a Ray client

Descendent tasks are cancelled when the calling task is cancelled

🔨 Fixes:

Improved object broadcast robustness

Improved placement group support

🏗 Architecture refactoring:

Progress towards the new scheduler backend

RLlib

🎉 New Features:

SUMO simulator integration (rllib/examples/simulators/sumo/). Huge thanks to Lara Codeca! (#11710)

SlateQ Algorithm added for PyTorch. Huge thanks to Henry Chen! (#11450)

MAML extension for all Models, except recurrent ones. (#11337)

Curiosity Exploration Module for tf1.x/2.x/eager. (#11945)

Minimal JAXModelV2 example. (#12502)

🔨 Fixes:

Fix RNN learning for tf2.x/eager. (#11720)

LSTM prev-action/prev-reward settable separately and prev-actions are now one-hot’d. (#12397)

PyTorch LR schedule not working. (#12396)

Various PyTorch GPU bug fixes. (#11609)

SAC loss not using prio. replay weights in critic’s loss term. (#12394)

Fix epsilon-greedy Exploration for nested action spaces. (#11453)

🏗 Architecture refactoring:

Trajectory View API on by default (faster PG-type algos by ~20% (e.g. PPO on Atari)). (#11717, #11826, #11747, and #11827)

Tune

🎉 New Features:

Loggers can now be passed as objects to tune.run. The new ExperimentLogger abstraction was introduced for all loggers, making it much easier to configure logging behavior. (#11984, #11746, #11748, #11749)

The tune verbosity was refactored into four levels: 0: Silent, 1: Only experiment-level logs, 2: General trial-level logs, 3: Detailed trial-level logs (default) (#11767, #12132, #12571)

Docker and Kubernetes autoscaling environments are detected automatically, automatically utilizing the correct checkpoint/log syncing tools (#12108)

Trainables can now easily leverage Tensorflow DistributedStrategy! (#11876)

💫 Enhancements

Introduced a new serialization debugging utility (#12142)

Added a new lightweight Pytorch-lightning example (#11497, #11585)

The BOHB search algorithm can be seeded with a random state (#12160)

The default anonymous metrics can be used automatically if a mode is set in tune.run (#12159).

Added HDFS as Cloud Sync Client (#11524)

Added xgboost_ray integration (#12572)

Tune search spaces can now be passed to search algorithms on initialization, not only via tune.run (#11503)

Refactored and added examples (#11931)

Callable accepted for register_env (#12618)

Tune search algorithms can handle/ignore infinite and NaN numbers (#11835)

Improved scalability for experiment checkpointing (#12064)

Nevergrad now supports points_to_evaluate (#12207)

Placement group support for distributed training (#11934)

🔨 Fixes:

Fixed with_parameters behavior to avoid serializing large data in scope (#12522)

TBX logger supports None (#12262)

Better error when metric or mode unset in search algorithms (#11646)

Better warnings/exceptions for fail_fast='raise' (#11842)

Removed some bottlenecks in trialrunner (#12476)

Fix file descriptor leak by syncer and Tensorboard (#12590, #12425)

Fixed validation for search metrics (#11583)

Fixed hyperopt randint limits (#11946)

Serve

🎉 New Features:

You can start backends in different conda environments! See more in the dependency management doc. (#11743)

You can add a optional reconfigure method to your Servable to allow reconfiguring backend replicas at runtime. (#11709)

🔨Fixes:

Set serve.start(http_host=None) to disable HTTP servers. If you are only using ServeHandle, this option lowers resource usage. (#11627)

Flask requests will no longer create reference cycles. This means peak memory usage should be lower for high traffic scenarios. (#12560)

🏗 Architecture refactoring:

Progress towards a goal state driven Serve controller. (#12369,#11792,#12211,#12275,#11533,#11822,#11579,#12281)

Progress towards faster and more efficient ServeHandles. (#11905, #12019, #12093)

Ray Cluster Launcher (Autoscaler)

🎉 New Features:

A new Kubernetes operator: https://docs.ray.io/en/master/cluster/k8s-operator.html

💫 Enhancements

Containers do not run with root user as the default (#11407)

SHM-Size is auto-populated when using the containers (#11953)

🔨 Fixes:

Many autoscaler bug fixes (#11677, #12222, #11458, #11896, #12123, #11820, #12513, #11714, #12512, #11758, #11615, #12106, #11961, #11674, #12028, #12020, #12316, #11802, #12131, #11543, #11517, #11777, #11810, #11751, #12465, #11422)

SGD

🎉 New Features:

Easily customize your torch.DistributedDataParallel configurations by passing in a ddp_args field into TrainingOperator.register (#11771).

🔨 Fixes:

TorchTrainer now properly scales up to more workers if more resources become available (#12562)

📖 Documentation:

The new callback API for using Ray SGD with Tune is now documented (#11479)

Pytorch Lightning + Ray SGD integration is now documented (#12440)

Dashboard

🔨 Fixes:

Fixed bug that prevented viewing the logs for cluster workers

Fixed bug that caused "Logical View" page to crash when opening a list of actors for a given class.

🏗 Architecture refactoring:

Dashboard runs on a new backend architecture that is more scalable and well-tested. The dashboard should work on ~100 node clusters now, and we're working on lifting scalability to constraints to support even larger clusters.

Thanks

Many thanks to all those who contributed to this release: @bartbroere, @SongGuyang, @gramhagen, @richardliaw, @ConeyLiu, @weepingwillowben, @zhongchun, @ericl, @dHannasch, @timurlenk07, @kaushikb11, @krfricke, @desktable, @bcahlit, @rkooo567, @amogkam, @micahtyong, @edoakes, @stephanie-wang, @clay4444, @ffbin, @mfitton, @barakmich, @pcmoritz, @AmeerHajAli, @DmitriGekhtman, @iamhatesz, @raulchen, @ingambe, @allenyin55, @sven1977, @huyz-git, @yutaizhou, @suquark, @ashione, @simon-mo, @raoul-khour-ts, @Leemoonsoo, @maximsmol, @alanwguo, @kishansagathiya, @wuisawesome, @acxz, @gabrieleoliaro, @clarkzinzow, @jparkerholder, @kingsleykuan, @InnovativeInventor, @ijrsvt, @lasagnaphil, @lcodeca, @jiajiexiao, @heng2j, @wumuzi520, @mvindiola1, @aaronhmiller, @robertnishihara, @WangTaoTheTonic, @chaokunyang, @nikitavemuri, @kfstorm, @roireshef, @fyrestone, @viotemp1, @yncxcw, @karstenddwx, @hartikainen, @sumanthratna, @architkulkarni, @michaelzhiluo, @UWFrankGu, @oliverhu, @danuo, @lixin-wei
Source code(tar.gz)
Source code(zip)
ray-1.0.1.post1(Nov 19, 2020)
Patch release containing the following changes:

https://github.com/ray-project/ray/commit/bcc92f59fdcd837ccc5a560fe37bdf0619075505 Fix dashboard crashing on multi-node clusters.

https://github.com/ray-project/ray/pull/11600 Add the cluster_name to docker file mounts directory prefix.

Source code(tar.gz)
Source code(zip)
ray-1.0.1(Nov 10, 2020)
Ray 1.0.1

Ray 1.0.1 is now officially released!

Highlights

If you're migrating from Ray < 1.0.0, be sure to check out the 1.0 Migration Guide.

Autoscaler is now docker by default.

RLLib features multiple new environments.

Tune supports population based bandits, checkpointing in Docker, and multiple usability improvements.

SGD supports PyTorch Lightning

All of Ray's components and libraries have improved performance, scalability, and stability.

Core

1.0 Migration Guide.

Many bug fixes and optimizations in GCS.

Polishing of the Placement Group API.

Improved Java language support

RLlib

Added documentation for Curiosity exploration module (#11066).

Added RecSym environment wrapper (#11205).

Added Kaggle’s football environment (multi-agent) wrapper (#11249).

Multiple bug fixes: GPU related fixes for SAC (#11298), MARWIL, all example scripts run on GPU (#11105), lifted limitation on 2^31 timesteps (#11301), fixed eval workers for ES and ARS (#11308), fixed broken no-eager-no-workers mode (#10745).

Support custom MultiAction distributions (#11311).

No environment is created on driver (local worker) if not necessary (#11307).

Added simple SampleCollector class for Trajectory View API (#11056).

Code cleanup: Docstrings and type annotations for Exploration classes (#11251), DQN (#10710), MB-MPO algorithm, SAC algorithm (#10825).

Serve

API: Serve will error when serve_client is serialized. (#11181)

Performance: serve_client.get_handle("endpoint") will now get a handle to nearest node, increasing scalability in distributed mode. (#11477)

Doc: Added FAQ page and updated architecture page (#10754, #11258)

Testing: New distributed tests and benchmarks are added (#11386)

Testing: Serve now run on Windows (#10682)

SGD

Pytorch Lightning integration is now supported (#11042)

Support num_steps continue training (#11142)

Callback API for SGD+Tune (#11316)

Tune

New Algorithm: Population-based Bandits (#11466)

tune.with_parameters(), a wrapper function to pass arbitrary objects through the object store to trainables (#11504)

Strict metric checking - by default, Tune will now error if a result dict does not include the optimization metric as a key. You can disable this with TUNE_DISABLE_STRICT_METRIC_CHECKING (#10972)

Syncing checkpoints between multiple Docker containers on a cluster is now supported with the DockerSyncer (#11035)

Added type hints (#10806)

Trials are now dynamically created (instead of created up front) (#10802)

Use tune.is_session_enabled() in the Function API to toggle between Tune and non-tune code (#10840)

Support hierarchical search spaces for hyperopt (#11431)

Tune function API now also supports yield and return statements (#10857)

Tune now supports callbacks with tune.run(callbacks=... (#11001)

By default, the experiment directory will be dated (#11104)

Tune now supports reuse_actors for function API, which can largely accelerate tuning jobs.

Thanks

We thank all the contributors for their contribution to this release!

@acxz, @Gekho457, @allenyin55, @AnesBenmerzoug, @michaelzhiluo, @SongGuyang, @maximsmol, @WangTaoTheTonic, @Basasuya, @sumanthratna, @juliusfrost, @maxco2, @Xuxue1, @jparkerholder, @AmeerHajAli, @raulchen, @justinkterry, @herve-alanaai, @richardliaw, @raoul-khour-ts, @C-K-Loan, @mattearllongshot, @robertnishihara, @internetcoffeephone, @Servon-Lee, @clay4444, @fangyeqing, @krfricke, @ffbin, @akotlar, @rkooo567, @chaokunyang, @PidgeyBE, @kfstorm, @barakmich, @amogkam, @edoakes, @ashione, @jseppanen, @ttumiel, @desktable, @pcmoritz, @ingambe, @ConeyLiu, @wuisawesome, @fyrestone, @oliverhu, @ericl, @weepingwillowben, @rkube, @alanwguo, @architkulkarni, @lasagnaphil, @rohitrawat, @ThomasLecat, @stephanie-wang, @suquark, @ijrsvt, @VishDev12, @Leemoonsoo, @scottwedge, @sven1977, @yiranwang52, @carlos-aguayo, @mvindiola1, @zhongchun, @mfitton, @simon-mo
Source code(tar.gz)
Source code(zip)
ray-1.0.0(Sep 30, 2020)
Ray 1.0

We're happy to announce the release of Ray 1.0, an important step towards the goal of providing a universal API for distributed computing.

To learn more about Ray 1.0, check out our blog post and whitepaper.

Ray Core

The ray.init() and ray start commands have been cleaned up to remove deprecated arguments

The Ray Java API is now stable

Improved detection of Docker CPU limits

Add support and documentation for Dask-on-Ray and MARS-on-Ray: https://docs.ray.io/en/master/ray-libraries.html

Placement groups for fine-grained control over scheduling decisions: https://docs.ray.io/en/latest/placement-group.html.

New architecture whitepaper: https://docs.ray.io/en/master/whitepaper.html

Autoscaler

Support for multiple instance types in the same cluster: https://docs.ray.io/en/master/cluster/autoscaling.html

Support for specifying GPU/accelerator type in @ray.remote

Dashboard & Metrics

Improvements to the memory usage tab and machine view

The dashboard now supports visualization of actor states

Support for Prometheus metrics reporting: https://docs.ray.io/en/latest/ray-metrics.html

RLlib

Two Model-based RL algorithms were added: MB-MPO (“Model-based meta-policy optimization”) and “Dreamer”. Both algos were benchmarked and are performing comparably to the respective papers’ reported results.

A “Curiosity” (intrinsic motivation) module was added via RLlib’s Exploration API and benchmarked on a sparse-reward Unity3D environment (Pyramids).

Added documentation for the Distributed Execution API.

Removed (already soft-deprecated) APIs: Model(V1) class, Trainer config keys, some methods/functions. Where you would see a warning previously when using these, there will be an error thrown now.

Added DeepMind Control Suite examples.

Tune

Breaking changes:

Multiple tune.run parameters have been deprecated: ray_auto_init, run_errored_only, global_checkpoint_period, with_server (#10518)

tune.run(upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint have been moved to tune.SyncConfig [docs] (#10518)

New APIs:

mode, metric, time_budget parameters for tune.run (#10627, #10642)

Search Algorithms now share a uniform API: (#10621, #10444). You can also use the new create_scheduler/create_searcher shim layer to create search algorithms/schedulers via string, reducing boilerplate code (#10456).

Native callbacks for: MXNet, Horovod, Keras, XGBoost, PytorchLightning (#10533, #10304, #10509, #10502, #10220)

PBT runs can be replayed with PopulationBasedTrainingReplay scheduler (#9953)

Search Algorithms are saved/resumed automatically (#9972)

New Optuna Search Algorithm docs (#10044)

Tune now can sync checkpoints across Kubernetes pods (#10097)

Failed trials can be rerun with tune.run(resume="run_errored_only") (#10060)

Other Changes:

Trial outputs can be saved to file via tune.run(log_to_file=...) (#9817)

Trial directories can be customized, and default trial directory now includes trial name (#10608, #10214)

Improved Experiment Analysis API (#10645)

Support for Multi-objective search via SigOpt Wrapper (#10457, #10446)

BOHB Fixes (#10531, #10320)

Wandb improvements + RLlib compatibility (#10950, #10799, #10680, #10654, #10614, #10441, #10252, #8521)

Updated documentation for FAQ, Tune+serve, search space API, lifecycle (#10813, #10925, #10662, #10576, #9713, #10222, #10126, #9908)

RaySGD:

Creator functions are subsumed by the TrainingOperator API (#10321)

Training happens on actors by default (#10539)

Serve

serve.client API makes it easy to appropriately manage lifetime for multiple Serve clusters. (#10460)

Serve APIs are fully typed. (#10205, #10288)

Backend configs are now typed and validated via Pydantic. (#10559, #10389)

Progress towards application level backend autoscaler. (#9955, #9845, #9828)

New architecture page in documentation. (#10204)

Thanks

We thank all the contributors for their contribution to this release!

@MissiontoMars, @ijrsvt, @desktable, @kfstorm, @lixin-wei, @Yard1, @chaokunyang, @justinkterry, @pxc, @ericl, @WangTaoTheTonic, @carlos-aguayo, @sven1977, @gabrieleoliaro, @alanwguo, @aryairani, @kishansagathiya, @barakmich, @rkube, @SongGuyang, @qicosmos, @ffbin, @PidgeyBE, @sumanthratna, @yushan111, @juliusfrost, @edoakes, @mehrdadn, @Basasuya, @icaropires, @michaelzhiluo, @fyrestone, @robertnishihara, @yncxcw, @oliverhu, @yiranwang52, @ChuaCheowHuan, @raphaelavalos, @suquark, @krfricke, @pcmoritz, @stephanie-wang, @hekaisheng, @zhijunfu, @Vysybyl, @wuisawesome, @sanderland, @richardliaw, @simon-mo, @janblumenkamp, @zhuohan123, @AmeerHajAli, @iamhatesz, @mfitton, @noahshpak, @maximsmol, @weepingwillowben, @raulchen, @09wakharet, @ashione, @henktillman, @architkulkarni, @rkooo567, @zhe-thoughts, @amogkam, @kisuke95, @clarkzinzow, @holli, @raoul-khour-ts
Source code(tar.gz)
Source code(zip)
ray-0.8.7(Aug 13, 2020)
Highlight

Ray is moving towards 1.0! It has had several important naming changes.

ObjectIDs are now called ObjectRefs because they are not just IDs.

The Ray Autoscaler is now called the Ray Cluster Launcher. The autoscaler will be a module of the Ray Cluster Launcher.

The Ray Cluster Launcher now has a much cleaner and concise output style. Try it out with ray up --log-new-style. The new output style will be enabled by default (with opt-out) in a later release.

Windows is now officially supported by RLlib. Multi node support for Windows is still in progress.

Cluster Launcher/CLI (formerly autoscaler)

Highlight: This release contains a new colorful, concise output style for ray up and ray down, available with the --log-new-style flag. It will be enabled by default (with opt-out) in a later release. Full output style coverage for Cluster Launcher commands will also be available in a later release. (#9322, #9943, #9960, #9690)

Documentation improvements (with guides and new sections) (#9687

Improved Cluster launcher docker support (#9001, #9105, #8840)

Ray now has Docker images available on Docker hub. Please check out the ray image (#9732, #9556, #9458, #9281)

Azure improvements (#8938)

Improved on-prem cluster autoscaler (#9663)

Add option for continuous sync of file mounts (#9544)

Add ray status debug tool and ray --version (#9091, #8886).

ray memory now also supports redis_password (#9492)

Bug fixes for the Kubernetes cluster launcher mode (#9968)

Various improvements: disabling the cluster config cache (#8117), Python API requires keyword arguments (#9256), removed fingerprint checking for SSH (#9133), Initial support for multiple worker types (#9096), various changes to the internal node provider interface (#9340, #9443)

Core

Support Python type checking for Ray tasks (#9574)

Rename ObjectID => ObjectRef (#9353)

New GCS Actor manager on by default (#8845, #9883, #9715, #9473, #9275)

Worker towards placement groups (#9039)

Plasma store process is merged with raylet (#8939, #8897)

Option to automatically reconstruct objects stored in plasma after a failure. See the documentation for more information. (#9394, #9557, #9488)

Many bug fixes.

RLlib

New algorithm: “Model-Agnostic Meta-Learning” (MAML). An algo that learns and generalizes well across a distribution of environments.

New algorithm: “Model-Based Meta-Policy-Optimization” (MB-MPO). Our first model-based RL algo.

Windows is now officially supported by RLlib.

Native TensorFlow 2.x support. Use framework=”tf2” in your config to tap into TF2’s full potential. Also: SAC, DDPG, DQN Rainbow, ES, and ARS now run in TF1.x Eager mode.

DQN PyTorch support for full Rainbow setup (including distributional DQN).

Python type hints for Policy, Model, Offline, Evaluation, and Env classes.

Deprecated “Policy Optimizer” package (in favor of new distributed execution API).

Enhanced test coverage and stability.

Flexible multi-agent replay modes and replay_sequence_length. We now allow a) storing sequences (over time) in replay buffers and retrieving “lock-stepped” multi-agent samples.

Environments: Unity3D soccer game (tuned example/benchmark) and DM Control Suite wrapper and examples.

Various Bug fixes: QMIX not learning, DDPG torch bugs, IMPALA learning rate updates, PyTorch custom loss, PPO not learning MuJoCo due to action clipping bug, DQN w/o dueling layer error.

Tune

API Changes:

The Tune Function API now supports checkpointing and is now usable with all search and scheduling algorithms! (#8471, #9853, #9517)

The Trainable class API has renamed many of its methods to be public (#9184)

You can now stop experiments upon convergence with Bayesian Optimization (#8808)

DistributedTrainableCreator, a simple wrapper for distributed parameter tuning with multi-node DistributedDataParallel models (#9550, #9739)

New integration and tutorial for using Ray Tune with Weights and Biases (Logger and native API) (#9725)

Tune now provides a Scikit-learn compatible wrapper for hyperparameter tuning (#9129)

New tutorials for integrations like XGBoost (#9060), multi GPU PyTorch (#9338), PyTorch Lightning (#9151, #9451), and Huggingface-Transformers (#9789)

CLI Progress reporting improvements (#8802, #9537, #9525)

Various bug fixes: handling of NaN values (#9381), Tensorboard logging improvements (#9297, #9691, #8918), enhanced cross-platform compatibility (#9141), re-structured testing (#9609), documentation reorganization and versioning (#9600, #9427, #9448)

RaySGD

Variable worker CPU requirements (#8963)

Simplified cuda visible device setting (#8775)

Serve

Horizontal scalability: Serve will now start one HTTP server per Ray node. (#9523)

Various performance improvement matching Serve to FastAPI (#9490,#8709, #9531, #9479 ,#9225, #9216, #9485)

API changes

serve.shadow_traffic(endpoint, backend, fraction) duplicates and sends a fraction of the incoming traffic to a specific backend. (#9106)

serve.shutdown() cleanup the current Serve instance in Ray cluster. (#8766)

Exception will be raised if num_replicas exceeds the maximum resource in the cluster (#9005)

Added doc examples for how to perform metric monitoring and model composition.

Dashboard

Configurable Dashboard Port: The port on which the dashboard will run is now configurable using the argument --dashboard-port and the argument dashboard_port to ray.init

GPU monitoring improvements

For machines with more than one GPU, the GPU and GRAM utilization is now broken out on a per-GPU basis.

Assignments to physical GPUs are now shown at the worker level.

Sortable Machine View: It is now possible to sort the machine view by almost any of its columns by clicking next to the title. In addition, whereas the workers are normally grouped by node, you can now ungroup them if you only want to see details about workers.

Actor Search Bar: It is possible to search for actors by their title now (this is the class name of the actor in python in addition to the arguments it received.)

Logical View UI Updates: This includes things like color-coded names for each of the actor states, a more grid-like layout, and tooltips for the various data.

Sortable Memory View: Like the machine view, the memory view now has sortable columns and can be grouped / ungrouped by node.

Windows Support

Improve GPU detection (#9300)

Work around msgpack issue on PowerPC64LE (#9140)

Others

Ray Streaming Library Improvements (#9240, #8910, #8780)

Java Support Improvements (#9371, #9033, #9037, #9032, #8858, #9777, #9836, #9377)

Parallel Iterator Improvements (#8964, #8978)

Thanks

We thank the following contributors for their work on this release: @jsuarez5341, @amitsadaphule, @krfricke, @williamFalcon, @richardliaw, @heyitsmui, @mehrdadn, @robertnishihara, @gabrieleoliaro, @amogkam, @fyrestone, @mimoralea, @edoakes, @andrijazz, @ElektroChan89, @kisuke95, @justinkterry, @SongGuyang, @barakmich, @bloodymeli, @simon-mo, @TomVeniat, @lixin-wei, @alanwguo, @zhuohan123, @michaelzhiluo, @ijrsvt, @pcmoritz, @LecJackS, @sven1977, @ashione, @JerryLeeCS, @raphaelavalos, @stephanie-wang, @ruifangChen, @vnlitvinov, @yncxcw, @weepingwillowben, @goulou, @acmore, @wuisawesome, @gramhagen, @anabranch, @internetcoffeephone, @Alisahhh, @henktillman, @deanwampler, @p-christ, @Nicolaus93, @WangTaoTheTonic, @allenyin55, @kfstorm, @rkooo567, @ConeyLiu, @09wakharet, @piojanu, @mfitton, @KristianHolsheimer, @AmeerHajAli, @pdames, @ericl, @VishDev12, @suquark, @stefanbschneider, @raulchen, @dcfidalgo, @chappers, @aaarne, @chaokunyang, @sumanthratna, @clarkzinzow, @BalaBalaYi, @maximsmol, @zhongchun, @wumuzi520, @ffbin
Source code(tar.gz)
Source code(zip)
ray-0.8.6(Jun 24, 2020)
Highlight

Experimental support for Windows is now available for single node Ray usage. Check out the Windows section below for known issues and other details.

Have you had troubles monitoring GPU or memory usage while you used Ray? The Ray dashboard now supports the GPU monitoring and a memory view.

Want to use RLlib with Unity? RLlib officially supports the Unity3D adapter! Please check out the documentation.

Ray Serve is ready for feedback! We've gotten feedback from many users, and Ray Serve is already being used in production. Please reach out to us with your use cases, ideas, documentation improvements, and feedback. We'd love to hear from you. Please do so on the Ray Slack and join #serve! Please see the Serve section below for more details.

Core

We’ve introduced a new feature to automatically retry failed actor tasks after an actor has been restarted by Ray (by specifying max_restarts in @ray.remote). Try it out with max_task_retries=-1 where -1 indicates that the system can retry the task until it succeeds.

API Change

To enable automatic restarts of a failed actor, you must now use max_restarts in the @ray.remote decorator instead of max_reconstructions. You can use -1 to indicate infinity, i.e., the system should always restart the actor if it fails unexpectedly.

We’ve merged the named and detached actor APIs. To create an actor that will survive past the duration of its job (a “detached” actor), specify name=<str> in its remote constructor (Actor.options(name='<str>').remote()). To delete the actor, you can use ray.kill.

RLlib

PyTorch: IMPALA PyTorch version and all rllib/examples scripts now work for either TensorFlow or PyTorch (--torch command line option).

Switched to using distributed execution API by default (replaces Policy Optimizers) for all algorithms.

Unity3D adapter (supports all Env types: multi-agent, external env, vectorized) with example scripts for running locally or in the cloud.

Added support for variable length observation Spaces ("Repeated").

Added support for arbitrarily nested action spaces.

Added experimental GTrXL (Transformer/Attention net) support to RLlib + learning tests for PPO and IMPALA.

QMIX now supports complex observation spaces.

API Change

Retire use_pytorch and eager flags in configs and replace these with framework=[tf|tfe|torch].

Deprecate PolicyOptimizers in favor of the new distributed execution API.

Retired support for Model(V1) class. Custom Models should now only use the ModelV2 API. There is still a warning when using ModelV1, which will be changed into an error message in the next release.

Retired TupleActions (in favor of arbitrarily nested action Spaces).

Ray Tune / RaySGD

There is now a Dataset API for handling large datasets with RaySGD. (#7839)

You can now filter by an average of the last results using the ExperimentAnalysis tool (#8445).

BayesOptSearch received numerous contributions, enabling preliminary random search and warm starting. (#8541, #8486, #8488)

API Changes

tune.report is now the right way to use the Tune function API. tune.track is deprecated (#8388)

Serve

New APIs to inspect and manage Serve objects:

serve.list_backends and serve.list_endpoints (#8737)

serve.delete_backend and serve.delete_endpoint (#8252, #8256)

serve.create_endpoint now requires specifying the backend directly. You can remove serve.set_traffic if there's only one backend per endpoint. (#8764)

serve.init API cleanup, the following options were removed:

blocking, ray_init_kwargs, start_server (#8747, #8447, #8620)

serve.init now supports namespacing with name. You can run multiple serve clusters with different names on the same ray cluster. (#8449)

You can specify session affinity when splitting traffic with backends using X-SERVE-SHARD-KEY HTTP header. (#8449)

Various documentation improvements. Highlights:

A new section on how to perform A/B testing and incremental rollout (#8741)

Tutorial for batch inference (#8490)

Instructions for specifying GPUs and resources (#8495)

Dashboard / Metrics

The Machine View of the dashboard now shows information about GPU utilization such as:

Average GPU/GRAM utilization at a node and cluster level

Worker-level information about how many GPUs each worker is assigned as well as its GRAM use.

The dashboard has a new Memory View tab that should be very useful for debugging memory issues. It has:

Information about objects in the Ray object store, including size and call-site

Information about reference counts and what is keeping an object pinned in the Ray object store.

Small changes

IDLE workers get automatically sorted to the end of the worker list in the Machine View

Autoscaler

Improved logging output. Errors are more clearly propagated and excess output has been reduced. (#7198, #8751, #8753)

Added support for k8s services.

API Changes

ray up accepts remote URLs that point to the desired cluster YAML. (#8279)

Windows support

Windows wheels are now available for basic experimental usage (via ray.init()).

Windows support is currently unstable. Unusual, unattended, or production usage is not recommended.

Various functionality may still lack support, including Ray Serve, Ray SGD, the autoscaler, the dashboard, non-ASCII file paths, etc.

Please check the latest nightly wheels & known issues (#9114), and let us know if any issue you encounter has not yet been addressed.

Wheels are available for Python 3.6, 3.7, and 3.8. (#8369)

redis-py has been patched for Windows sockets. (#8386)

Others

Moving towards highly available Ray (#8650, #8639, #8606, #8601, #8591, #8442)

Java Support (#8730, #8640, #8637)

Ray streaming improvements (#8612, #8594, #7464)

Parallel iterator improvements (#8140, #7931, #8712)

Thanks

We thank the following contributors for their work on this release: @pcmoritz, @akharitonov, @devanderhoff, @ffbin, @anabranch, @jasonjmcghee, @kfstorm, @mfitton, @alecbrick, @simon-mo, @konichuvak, @aniryou, @wuisawesome, @robertnishihara, @ramanNarasimhan77, @09wakharet, @richardliaw, @istoica, @ThomasLecat, @sven1977, @ceteri, @acxz, @iamhatesz, @JarnoRFB, @rkooo567, @mehrdadn, @thomasdesr, @janblumenkamp, @ujvl, @edoakes, @maximsmol, @krfricke, @amogkam, @gehring, @ijrsvt, @internetcoffeephone, @LucaCappelletti94, @chaokunyang, @WangTaoTheTonic, @fyrestone, @raulchen, @ConeyLiu, @stephanie-wang, @suquark, @ashione, @Coac, @JosephTLucas, @ericl, @AmeerHajAli, @pdames
Source code(tar.gz)
Source code(zip)
ray-0.8.5(May 7, 2020)
Highlight

You can now cancel remote tasks using the ray.cancel API.

PyTorch is now a first-class citizen in RLlib! We've achieved parity between TensorFlow and PyTorch.

Did you struggle to find good example code for Ray ML libraries? We wrote more examples for Ray SGD and Ray Serve.

Ray serve: Keras/Tensorflow, PyTorch, Scikit-Learn.

Ray SGD: New Semantic Segmentation and HuggingFace GLUE Fine-tuning Examples.

Core

Task cancellation is now available for locally submitted tasks. (#7699)

Experimental support for recovering objects that were lost from the Ray distributed memory store. You can try this out by setting lineage_pinning_enabled: 1 in the internal config. (#7733)

RLlib

PyTorch support has now reached parity with TensorFlow. (#7926, #8188, #8120, #8101, #8106, #8104, #8082, #7953, #7984, #7836, #7597, #7797)

Improved callbacks API. (#6972)

Enable Ray distributed reference counting. (#8037)

Work towards customizable distributed training workflows. (#7958, #8077)

Tune

Documentation has improved with a new format. (#8083, #8201, #7716)

Search algorithms are refactored to make them easier to extend, deprecating max_concurrent argument. (#7037, #8258, #8285)

TensorboardX errors are now handled safely. (#8174)

Bug fix in PBT checkpointing. (#7794)

New ZOOpt search algorithm added. (#7960)

Serve

Improved APIs.

Add delete_endpoint and delete_backend. (#8252, #8256)

Use dictionary to update backend config. (#8202)

Added overview section to the documentation.

Added tutorials for serving models in Tensorflow/Keras, PyTorch, and Scikit-Learn.

Made serve clusters tolerant to process failures. (#8116, #8008,#7970,#7936)

SGD

New Semantic Segmentation and HuggingFace GLUE Fine-tuning Examples. (#7792, #7825)

Fix GPU Reservations in SLURM usage. (#8157)

Update learning rate scheduler stepping parameter. (#8107)

Make serialization of data creation optional. (#8027)

Automatic DDP wrapping is now optional. (#7875)

Others Projects

Progress towards the highly available and fault tolerant control plane. (#8144, #8119, #8145, #7909, #7949, #7771, #7557, #7675)

Progress towards the Ray streaming library. (#8044, #7827, #7955, #7961, #7348)

Autoscaler improvement. (#8178, #8168, #7986, #7844, #7717)

Progress towards Java support. (#8014)

Progress towards the Window compatibility. (#8237, #8186)

Progress towards cross language support. (#7711)

Thanks

We thank the following contributors for their work on this release:

@simon-mo, @robertnishihara, @BalaBalaYi, @ericl, @kfstorm, @tirkarthi, @nflu, @ffbin, @chaokunyang, @ijrsvt, @pcmoritz, @mehrdadn, @sven1977, @iamhatesz, @nmatthews-asapp, @mitchellstern, @edoakes, @anabranch, @billowkiller, @eisber, @ujvl, @allenyin55, @yncxcw, @deanwampler, @DavidMChan, @ConeyLiu, @micafan, @rkooo567, @datayjz, @wizardfishball, @sumanthratna, @ashione, @marload, @stephanie-wang, @richardliaw, @jovany-wang, @MissiontoMars, @aannadi, @fyrestone, @JarnoRFB, @wumuzi520, @roireshef, @acxz, @gramhagen, @Servon-Lee, @ClarkZinzow, @mfitton, @maximsmol, @janblumenkamp, @istoica
Source code(tar.gz)
Source code(zip)

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Related tags

Overview

Quick Start

Tune Quick Start

RLlib Quick Start

Ray Serve Quick Start

More Information

Getting Involved

Comments

What do these changes do?

What do these changes do?

Related issue number

What do these changes do?

Related issue number

Why are these changes needed?

Related issue number

Checks

What do these changes do?

Related issue number

Linter

What do these changes do?

Why are these changes needed?

Related issue number

Checks

What do these changes do?

Why are these changes needed?

Related issue number

Checks

What do these changes do?

How do these changes work?

Related issue number

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Why are these changes needed?

Related issue number

Checks

10.2.0 (2022-12-03)

10.1.0 (2022-09-21)

10.0.0 (2022-08-10)

9.6.0 (2022-05-06)

9.5.0 (2022-03-29)

9.4.0 (2022-03-13)

9.3.1 (2022-01-10)

9.3.0 (2021-12-13)

9.2.2 (2021-10-06)

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Why are these changes needed?

Related issue number

Checks

Releases(ray-2.2.0)

ray-2.2.0(Dec 13, 2022)

Release Highlights

Ray Libraries

Ray AIR

Ray Data Processing

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Ray Clusters

Dashboard

ray-2.1.0(Nov 8, 2022)

Release Highlights

Ray Libraries

Ray AIR

Ray Data Processing

Ray Train

Ray Tune