TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

Last update: Dec 29, 2022

Related tags

Reinforcement Learning reinforcement-learning tensorflow dqn multi-armed-bandits bandits contextual-bandits rl-algorithms tf-agents

Overview

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

TF-Agents makes implementing, deploying, and testing new Bandits and RL algorithms easier. It provides well tested and modular components that can be modified and extended. It enables fast code iteration, with good test integration and benchmarking.

To get started, we recommend checking out one of our Colab tutorials. If you need an intro to RL (or a quick recap), start here. Otherwise, check out our DQN tutorial to get an agent up and running in the Cartpole environment. API documentation for the current stable release is on tensorflow.org.

TF-Agents is under active development and interfaces may change at any time. Feedback and comments are welcome.

Agents
Tutorials
Multi-Armed Bandits
Examples
Installation
Contributing
Releases
Principles
Citation
Disclaimer

Agents

In TF-Agents, the core elements of RL algorithms are implemented as Agents. An agent encompasses two main responsibilities: defining a Policy to interact with the Environment, and how to learn/train that Policy from collected experience.

Currently the following algorithms are available under TF-Agents:

Tutorials

See docs/tutorials/ for tutorials on the major components provided.

Multi-Armed Bandits

The TF-Agents library contains a comprehensive Multi-Armed Bandits suite, including Bandits environments and agents. RL agents can also be used on Bandit environments. There is a tutorial in bandits_tutorial.ipynb. and ready-to-run examples in tf_agents/bandits/agents/examples/v2.

Examples

End-to-end examples training agents can be found under each agent directory. e.g.:

DQN: tf_agents/agents/dqn/examples/v2/train_eval.py

Installation

TF-Agents publishes nightly and stable builds. For a list of releases read the Releases section. The commands below cover installing TF-Agents stable and nightly from pypi.org as well as from a GitHub clone.

Stable

Run the commands below to install the most recent stable release. API documentation for the release is on tensorflow.org.

$ pip install --user tf-agents[reverb]

# Use this tag get the matching examples and colabs.
$ git clone https://github.com/tensorflow/agents.git
$ cd agents
$ git checkout v0.6.0

If you want to install TF-Agents with versions of Tensorflow or Reverb that are flagged as not compatible by the pip dependency check, use the following pattern below at your own risk.

$ pip install --user tensorflow
$ pip install --user dm-reverb
$ pip install --user tf-agents

If you want to use TF-Agents with TensorFlow 1.15 or 2.0, install version 0.3.0:

# Newer versions of tensorflow-probability require newer versions of TensorFlow.
$ pip install tensorflow-probability==0.8.0
$ pip install tf-agents==0.3.0

Nightly

Nightly builds include newer features, but may be less stable than the versioned releases. The nightly build is pushed as tf-agents-nightly. We suggest installing nightly versions of TensorFlow (tf-nightly) and TensorFlow Probability (tfp-nightly) as those are the versions TF-Agents nightly are tested against.

To install the nightly build version, run the following:

# `--force-reinstall helps guarantee the right versions.
$ pip install --user --force-reinstall tf-nightly
$ pip install --user --force-reinstall tfp-nightly
$ pip install --user --force-reinstall dm-reverb-nightly

# Installing with the `--upgrade` flag ensures you'll get the latest version.
$ pip install --user --upgrade tf-agents-nightly

From GitHub

After cloning the repository, the dependencies can be installed by running pip install -e .[tests]. TensorFlow needs to be installed independently: pip install --user tf-nightly.

Contributing

We're eager to collaborate with you! See CONTRIBUTING.md for a guide on how to contribute. This project adheres to TensorFlow's code of conduct. By participating, you are expected to uphold this code.

Releases

TF Agents has stable and nightly releases. The nightly releases are often fine but can have issues due to upstream libraries being in flux. The table below lists the version(s) of TensorFlow tested with each TF Agents' release to help users that may be locked into a specific version of TensorFlow. 0.3.0 was the last release compatible with Python 2.

Release	Branch / Tag	TensorFlow Version
Nightly	master	tf-nightly
0.7.1	v0.7.1	2.4.0
0.6.0	v0.6.0	2.3.0
0.5.0	v0.5.0	2.2.0
0.4.0	v0.4.0	2.1.0
0.3.0	v0.3.0	1.15.0 and 2.0.0

Principles

This project adheres to Google's AI principles. By participating, using or contributing to this project you are expected to adhere to these principles.

Citation

If you use this code, please cite it as:

@misc{TFAgents,
  title = {{TF-Agents}: A library for Reinforcement Learning in TensorFlow},
  author = {Sergio Guadarrama and Anoop Korattikara and Oscar Ramirez and
     Pablo Castro and Ethan Holly and Sam Fishman and Ke Wang and
     Ekaterina Gonina and Neal Wu and Efi Kokiopoulou and Luciano Sbaiz and
     Jamie Smith and Gábor Bartók and Jesse Berent and Chris Harris and
     Vincent Vanhoucke and Eugene Brevdo},
  howpublished = {\url{https://github.com/tensorflow/agents}},
  url = "https://github.com/tensorflow/agents",
  year = 2018,
  note = "[Online; accessed 25-June-2019]"
}

Disclaimer

This is not an official Google product.

Comments

Error loading DqnAgent saved model.

I am creating a tf-agent DqnAgent in the following code:

    tf_agent = dqn_agent.DqnAgent(
        train_env.time_step_spec(),
        train_env.action_spec(),
        q_network=q_net,
        optimizer=optimizer,
        td_errors_loss_fn=dqn_agent.element_wise_squared_loss,
        train_step_counter=train_step_counter
)

During the training loop I am saving this model with

    tf.saved_model.save(tf_agent, saved_models_path)

Once trained, I want to load saved model with

    if tf.saved_model.contains_saved_model(saved_models_path):
        tf_agent = tf.saved_model.load(saved_models_path)

This code will load the saved model only if the folder in saved_path contains one, the functions contains_saved_model(saved_models_path) returns True, so the model is loaded, but there is an excetion and the program crashes:

    Traceback (most recent call last):
        File "/home/claudino/Projetos/dino-tf-agents/dino_ia/model/agent.py", line 50, in <module>
            tf_agent = tf.saved_model.load(saved_models_path)
        File "/home/claudino/Projetos/dino-tf-agents/venv/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 408, in load
            return load_internal(export_dir, tags)
        File "/home/claudino/Projetos/dino-tf-agents/venv/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 432, in load_internal
            export_dir)
        File "/home/claudino/Projetos/dino-tf-agents/venv/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 58, in __init__
            self._load_all()
        File "/home/claudino/Projetos/dino-tf-agents/venv/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 168, in _load_all
            slot_variable = optimizer_object.add_slot(
        AttributeError: '_UserObject' object has no attribute 'add_slot'

        Process finished with exit code 1

opened by andreclaudino 23

TRAIN TF-AGENTS WITH MULTIPLE GPUs

Hi, I finally got my vm up and running using: 2 Tesla P100 NVIDIA driver 440.33.01 cuda 10.2 tensorflow=2.1.0 tf_agents=0.3.0

I start training a custom model/env based on SAC agent v2 train loop but only one GPU is used. My question : at the moment is tf-agents able to manage distribute training on multiple GPU? or should I use only one?
type:support level:p1

opened by JCMiles 22
network.create_variables() clogs all GPU memory

On calling network.create_variables() for my agent (using a DDPG agent), my GPU memory gets used 100% instantly and never clears up. I can control it by using a virtual memory cap, but I need memory for other computation downstream (CNN etc.) and the memory cap ensures there is no memory left for anything else.

Why might this be happening and how do I get around this?

opened by PrieureDeSion 20

tf-agents SAC 10x slower than stable-baselines on same hardware

I am running a simple test of SAC using the LunarLanderContinuous-v2 environment. Training is for 500,000 steps with a replay buffer of size 50,000 (see code below). tf-agents takes over 10 hours to complete training whereas the stable-baselines implementation of SAC using the same hyperparameters only takes 39 minutes. I've checked and double-check my version of CUDA, tensorflow-gpu, tf-agent, etc and cannot speed things up.

Here are the details to reproduce:

Ubuntu 16.04, tf-agents==0.3.0, tensorflow-gpu==1.15.0, gym==0.15.4, CUDA==10.0, cudnn==7.6.5, stable-baselines==2.9.0a0, GPU==Quadro M4000 8Gb, CPU==i7 64 Gb

My tf-agents test script is simply the v2 train_eval.py script from the sac/examples after substituting the LunarLanderContinuous-v2 environment for Half Cheetah and changing the hyperparameters as you can see below:

# coding=utf-8
# Copyright 2018 The TF-Agents Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

r"""Train and Eval SAC.

To run:

#bash
#tensorboard --logdir $HOME/tmp/sac/gym/HalfCheetah-v2/ --port 2223 &
#
#python tf_agents/agents/sac/examples/v2/train_eval.py \
#  --root_dir=$HOME/tmp/sac/gym/HalfCheetah-v2/ \
#  --alsologtostderr
#```
#"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import time

from absl import app
from absl import flags
from absl import logging

import gin
import tensorflow as tf

from tf_agents.agents.ddpg import critic_network
from tf_agents.agents.sac import sac_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import parallel_py_environment
from tf_agents.environments import suite_mujoco
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import actor_distribution_network
from tf_agents.networks import normal_projection_network
from tf_agents.policies import greedy_policy
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.utils import common

flags.DEFINE_string('root_dir', os.getenv('TEST_UNDECLARED_OUTPUTS_DIR'),
                    'Root directory for writing logs/summaries/checkpoints.')
flags.DEFINE_multi_string('gin_file', None, 'Path to the trainer config files.')
flags.DEFINE_multi_string('gin_param', None, 'Gin binding to pass through.')

FLAGS = flags.FLAGS


@gin.configurable
def normal_projection_net(action_spec,
                          init_action_stddev=0.35,
                          init_means_output_factor=0.1):
  del init_action_stddev
  return normal_projection_network.NormalProjectionNetwork(
      action_spec,
      mean_transform=None,
      state_dependent_std=True,
      init_means_output_factor=init_means_output_factor,
      std_transform=sac_agent.std_clip_transform,
      scale_distribution=True)


_DEFAULT_REWARD_SCALE = 0


@gin.configurable
def train_eval(
    root_dir,
    env_name='LunarLanderContinuous-v2',
    eval_env_name=None,
    env_load_fn=suite_mujoco.load,
    num_iterations=500000,
    actor_fc_layers=(64, 64),
    critic_obs_fc_layers=None,
    critic_action_fc_layers=None,
    critic_joint_fc_layers=(64, 64),
    num_parallel_environments=1,
    # Params for collect
    initial_collect_steps=100,
    collect_steps_per_iteration=1,
    replay_buffer_capacity=50000,
    # Params for target update
    target_update_tau=0.005,
    target_update_period=1,
    # Params for train
    train_steps_per_iteration=1,
    batch_size=64,
    actor_learning_rate=3e-4,
    critic_learning_rate=3e-4,
    alpha_learning_rate=3e-4,
    td_errors_loss_fn=tf.compat.v1.losses.mean_squared_error,
    gamma=0.99,
    reward_scale_factor=_DEFAULT_REWARD_SCALE,
    gradient_clipping=None,
    use_tf_functions=True,
    # Params for eval
    num_eval_episodes=100,
    eval_interval=1000,
    # Params for summaries and logging
    train_checkpoint_interval=10000,
    policy_checkpoint_interval=5000,
    rb_checkpoint_interval=50000,
    log_interval=1000,
    summary_interval=1000,
    summaries_flush_secs=10,
    debug_summaries=False,
    summarize_grads_and_vars=False,
    eval_metrics_callback=None):
  """A simple train and eval for SAC on Mujoco.

  All hyperparameters come from the original SAC paper
  (https://arxiv.org/pdf/1801.01290.pdf).
  """

  if reward_scale_factor == _DEFAULT_REWARD_SCALE:
    # Use value recommended by https://arxiv.org/abs/1801.01290
    if env_name.startswith('Humanoid'):
      reward_scale_factor = 20.0
    else:
      reward_scale_factor = 5.0

  root_dir = os.path.expanduser(root_dir)

  summary_writer = tf.compat.v2.summary.create_file_writer(
      root_dir, flush_millis=summaries_flush_secs * 1000)
  summary_writer.set_as_default()

  eval_metrics = [
      tf_metrics.AverageReturnMetric(buffer_size=num_eval_episodes),
      tf_metrics.AverageEpisodeLengthMetric(buffer_size=num_eval_episodes)
  ]

  global_step = tf.compat.v1.train.get_or_create_global_step()
  with tf.compat.v2.summary.record_if(
      lambda: tf.math.equal(global_step % summary_interval, 0)):
    # create training environment
    if num_parallel_environments == 1:
      py_env = env_load_fn(env_name)
    else:
      py_env = parallel_py_environment.ParallelPyEnvironment(
          [lambda: env_load_fn(env_name)] * num_parallel_environments)
    tf_env = tf_py_environment.TFPyEnvironment(py_env)
    # create evaluation environment
    eval_env_name = eval_env_name or env_name
    eval_py_env = env_load_fn(eval_env_name)
    eval_tf_env = tf_py_environment.TFPyEnvironment(eval_py_env)

    time_step_spec = tf_env.time_step_spec()
    observation_spec = time_step_spec.observation
    action_spec = tf_env.action_spec()

    actor_net = actor_distribution_network.ActorDistributionNetwork(
        observation_spec,
        action_spec,
        fc_layer_params=actor_fc_layers,
        continuous_projection_net=normal_projection_net)
    critic_net = critic_network.CriticNetwork(
        (observation_spec, action_spec),
        observation_fc_layer_params=critic_obs_fc_layers,
        action_fc_layer_params=critic_action_fc_layers,
        joint_fc_layer_params=critic_joint_fc_layers)

    tf_agent = sac_agent.SacAgent(
        time_step_spec,
        action_spec,
        actor_network=actor_net,
        critic_network=critic_net,
        actor_optimizer=tf.compat.v1.train.AdamOptimizer(
            learning_rate=actor_learning_rate),
        critic_optimizer=tf.compat.v1.train.AdamOptimizer(
            learning_rate=critic_learning_rate),
        alpha_optimizer=tf.compat.v1.train.AdamOptimizer(
            learning_rate=alpha_learning_rate),
        target_update_tau=target_update_tau,
        target_update_period=target_update_period,
        td_errors_loss_fn=td_errors_loss_fn,
        gamma=gamma,
        reward_scale_factor=reward_scale_factor,
        gradient_clipping=gradient_clipping,
        debug_summaries=debug_summaries,
        summarize_grads_and_vars=summarize_grads_and_vars,
        train_step_counter=global_step)
    tf_agent.initialize()

    # Make the replay buffer.
    replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
        data_spec=tf_agent.collect_data_spec,
        batch_size=num_parallel_environments,
        max_length=replay_buffer_capacity)
    replay_observer = [replay_buffer.add_batch]

    env_steps = tf_metrics.EnvironmentSteps(prefix='Train')
    average_return = tf_metrics.AverageReturnMetric(
        prefix='Train',
        buffer_size=num_eval_episodes,
        batch_size=tf_env.batch_size)
    train_metrics = [
        tf_metrics.NumberOfEpisodes(prefix='Train'),
        env_steps,
        average_return,
        tf_metrics.AverageEpisodeLengthMetric(
            prefix='Train',
            buffer_size=num_eval_episodes,
            batch_size=tf_env.batch_size),
    ]

    eval_policy = greedy_policy.GreedyPolicy(tf_agent.policy)
    initial_collect_policy = random_tf_policy.RandomTFPolicy(
        tf_env.time_step_spec(), tf_env.action_spec())
    collect_policy = tf_agent.collect_policy

    train_checkpointer = common.Checkpointer(
        ckpt_dir=os.path.join(root_dir, 'train'),
        agent=tf_agent,
        global_step=global_step,
        metrics=metric_utils.MetricsGroup(train_metrics, 'train_metrics'))
    policy_checkpointer = common.Checkpointer(
        ckpt_dir=os.path.join(root_dir, 'policy'),
        policy=eval_policy,
        global_step=global_step)
    rb_checkpointer = common.Checkpointer(
        ckpt_dir=os.path.join(root_dir, 'replay_buffer'),
        max_to_keep=1,
        replay_buffer=replay_buffer)

    train_checkpointer.initialize_or_restore()
    rb_checkpointer.initialize_or_restore()

    initial_collect_driver = dynamic_step_driver.DynamicStepDriver(
        tf_env,
        initial_collect_policy,
        observers=replay_observer + train_metrics,
        num_steps=initial_collect_steps)

    collect_driver = dynamic_step_driver.DynamicStepDriver(
        tf_env,
        collect_policy,
        observers=replay_observer + train_metrics,
        num_steps=collect_steps_per_iteration)

    if use_tf_functions:
      initial_collect_driver.run = common.function(initial_collect_driver.run)
      collect_driver.run = common.function(collect_driver.run)
      tf_agent.train = common.function(tf_agent.train)

    # Collect initial replay data.
    if env_steps.result() == 0 or replay_buffer.num_frames() == 0:
      logging.info(
          'Initializing replay buffer by collecting experience for %d steps'
          'with a random policy.', initial_collect_steps)
      initial_collect_driver.run()

    results = metric_utils.eager_compute(
        eval_metrics,
        eval_tf_env,
        eval_policy,
        num_episodes=num_eval_episodes,
        train_step=env_steps.result(),
        summary_writer=summary_writer,
        summary_prefix='Eval',
    )
    if eval_metrics_callback is not None:
      eval_metrics_callback(results, env_steps.result())
    metric_utils.log_metrics(eval_metrics)

    time_step = None
    policy_state = collect_policy.get_initial_state(tf_env.batch_size)

    time_acc = 0
    env_steps_before = env_steps.result().numpy()

    # Dataset generates trajectories with shape [Bx2x...]
    dataset = replay_buffer.as_dataset(
        num_parallel_calls=3, sample_batch_size=batch_size,
        num_steps=2).prefetch(3)
    iterator = iter(dataset)

    def train_step():
      experience, _ = next(iterator)
      return tf_agent.train(experience)

    if use_tf_functions:
      train_step = common.function(train_step)

    for _ in range(num_iterations):
      start_time = time.time()
      time_step, policy_state = collect_driver.run(
          time_step=time_step,
          policy_state=policy_state,
      )
      for _ in range(train_steps_per_iteration):
        train_step()
      time_acc += time.time() - start_time

      if global_step.numpy() % log_interval == 0:
        logging.info('env steps = %d, average return = %f', env_steps.result(),
                     average_return.result())
        env_steps_per_sec = (env_steps.result().numpy() -
                             env_steps_before) / time_acc
        logging.info('%.3f env steps/sec', env_steps_per_sec)
        tf.compat.v2.summary.scalar(
            name='env_steps_per_sec',
            data=env_steps_per_sec,
            step=env_steps.result())
        time_acc = 0
        env_steps_before = env_steps.result().numpy()

      for train_metric in train_metrics:
        train_metric.tf_summaries(train_step=env_steps.result())

      if global_step.numpy() % eval_interval == 0:
        results = metric_utils.eager_compute(
            eval_metrics,
            eval_tf_env,
            eval_policy,
            num_episodes=num_eval_episodes,
            train_step=env_steps.result(),
            summary_writer=summary_writer,
            summary_prefix='Eval',
        )
        if eval_metrics_callback is not None:
          eval_metrics_callback(results, env_steps.result())
        metric_utils.log_metrics(eval_metrics)

      global_step_val = global_step.numpy()
      if global_step_val % train_checkpoint_interval == 0:
        train_checkpointer.save(global_step=global_step_val)

      if global_step_val % policy_checkpoint_interval == 0:
        policy_checkpointer.save(global_step=global_step_val)

      if global_step_val % rb_checkpoint_interval == 0:
        rb_checkpointer.save(global_step=global_step_val)


def main(_):
  tf.compat.v1.enable_v2_behavior()
  logging.set_verbosity(logging.INFO)
  gin.parse_config_files_and_bindings(FLAGS.gin_file, FLAGS.gin_param)
  train_eval(FLAGS.root_dir)


if __name__ == '__main__':
  flags.mark_flag_as_required('root_dir')
  app.run(main)

My stable-baselines script looks like this:

import gym
import numpy as np

from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.common import make_vec_env
from stable_baselines.sac.policies import MlpPolicy
from stable_baselines import SAC

env = make_vec_env('LunarLanderContinuous-v2', n_envs=1)

model_name = "sac_lunar_lander"

model = SAC(MlpPolicy, env, verbose=1, tensorboard_log="./tensorboard_logs/stable_baselines_test")

model.learn(total_timesteps=500000, log_interval=10)
model.save(model_name)

Finally, here is the output when I run the tf-agents script to show that the GPU is being detected and used:

2019-12-22 11:26:35.054589: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-12-22 11:26:35.068596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Quadro M4000 major: 5 minor: 2 memoryClockRate(GHz): 0.7725
pciBusID: 0000:01:00.0
2019-12-22 11:26:35.068767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-12-22 11:26:35.069770: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-12-22 11:26:35.070479: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-12-22 11:26:35.070640: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-12-22 11:26:35.071572: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-12-22 11:26:35.072306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-12-22 11:26:35.074604: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-22 11:26:35.075808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-12-22 11:26:35.076022: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-22 11:26:35.080915: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3407920000 Hz
2019-12-22 11:26:35.081214: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555945a77880 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-12-22 11:26:35.081228: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2019-12-22 11:26:35.144953: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555945a9b180 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2019-12-22 11:26:35.144974: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro M4000, Compute Capability 5.2
2019-12-22 11:26:35.145550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Quadro M4000 major: 5 minor: 2 memoryClockRate(GHz): 0.7725
pciBusID: 0000:01:00.0
2019-12-22 11:26:35.145578: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-12-22 11:26:35.145588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-12-22 11:26:35.145597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-12-22 11:26:35.145605: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-12-22 11:26:35.145629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-12-22 11:26:35.145650: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-12-22 11:26:35.145674: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-22 11:26:35.146551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-12-22 11:26:35.146575: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-12-22 11:26:35.147375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-22 11:26:35.147384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2019-12-22 11:26:35.147388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2019-12-22 11:26:35.148348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6876 MB memory) -> physical GPU (device: 0, name: Quadro M4000, pci bus id: 0000:01:00.0, compute capability: 5.2)
/home/patrick/src/gym/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
WARNING:tensorflow:From /home/patrick/src/tf_agents/tf_agents/agents/ddpg/critic_network.py:141: The name tf.keras.initializers.RandomUniform is deprecated. Please use tf.compat.v1.keras.initializers.RandomUniform instead.

W1222 11:26:35.589284 140187933329152 module_wrapper.py:139] From /home/patrick/src/tf_agents/tf_agents/agents/ddpg/critic_network.py:141: The name tf.keras.initializers.RandomUniform is deprecated. Please use tf.compat.v1.keras.initializers.RandomUniform instead.

2019-12-22 11:26:35.600509: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
WARNING:tensorflow:From /home/patrick/src/tf_agents/tf_agents/distributions/utils.py:92: AffineScalar.__init__ (from tensorflow_probability.python.bijectors.affine_scalar) is deprecated and will be removed after 2020-01-01.
Instructions for updating:
`AffineScalar` bijector is deprecated; please use `tfb.Shift(loc)(tfb.Scale(...))` instead.
W1222 11:26:35.787435 140187933329152 deprecation.py:323] From /home/patrick/src/tf_agents/tf_agents/distributions/utils.py:92: AffineScalar.__init__ (from tensorflow_probability.python.bijectors.affine_scalar) is deprecated and will be removed after 2020-01-01.
Instructions for updating:
`AffineScalar` bijector is deprecated; please use `tfb.Shift(loc)(tfb.Scale(...))` instead.
I1222 11:26:35.814536 140187933329152 common.py:920] Checkpoint available: tensorboard_logs/tf_agents_v2/train/ckpt-30000
I1222 11:26:35.902629 140187933329152 common.py:920] Checkpoint available: tensorboard_logs/tf_agents_v2/policy/ckpt-35000
I1222 11:26:35.908307 140187933329152 common.py:923] No checkpoint available at tensorboard_logs/tf_agents_v2/replay_buffer
I1222 11:26:35.910735 140187933329152 tf_agents_v2_lunar_lander.py:267] Initializing replay buffer by collecting experience for 100 stepswith a random policy.
WARNING:tensorflow:From /home/patrick/src/tf_agents/tf_agents/metrics/tf_metrics.py:161: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1222 11:26:36.424730 140187933329152 deprecation.py:323] From /home/patrick/src/tf_agents/tf_agents/metrics/tf_metrics.py:161: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
I1222 11:28:23.095548 140187933329152 metric_utils.py:47]  
		 AverageReturn = 1.452040195465088
		 AverageEpisodeLength = 501.0
I1222 11:28:34.015443 140187933329152 tf_agents_v2_lunar_lander.py:314] env steps = 31200, average return = -80.228371
I1222 11:28:34.015817 140187933329152 tf_agents_v2_lunar_lander.py:317] 131.060 env steps/sec
etc.

And the output from nvidia-smi while running the script:

Sun Dec 22 11:29:16 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.129      Driver Version: 410.129      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M4000        Off  | 00000000:01:00.0  On |                  N/A |
| 51%   56C    P0    43W / 120W |   7865MiB /  8104MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1370      G   /usr/lib/xorg/Xorg                           435MiB |
|    0      2062      G   compiz                                       146MiB |
|    0      3479      G   ...uest-channel-token=17571043003057555071   211MiB |
|    0     17466      C   python                                      7057MiB |
+-----------------------------------------------------------------------------+

type:performance level:p1

opened by pirobot 18

tf-agents-nightly installed on colab seems very different from the master branch

tf-agents-nightly installed on colab seems very different from the master branch. The experimental examples folder is missing . Not 100% if this is a colab issue or a tf-agents issue.

opened by chokosabe 17

Problem with importing the "reverb" package with Tutorial: SAC minitaur with the Actor-Learner API

HI,

I am getting an ImportError when trying to import the "reverb" package as done in the tutorial.

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-38745e83da94> in <module>
      4 import matplotlib.pyplot as plt
      5 import os
----> 6 import reverb
      7 import tempfile
      8 import PIL.Image

~/Desktop/AI/ai_venv/lib/python3.7/site-packages/reverb/__init__.py in <module>
     25 # pylint: enable=g-bad-import-order
     26 
---> 27 from reverb import item_selectors as selectors
     28 from reverb import rate_limiters
     29 

~/Desktop/AI/ai_venv/lib/python3.7/site-packages/reverb/item_selectors.py in <module>
     17 import functools
     18 
---> 19 from reverb import pybind
     20 
     21 Fifo = pybind.FifoSelector

~/Desktop/AI/ai_venv/lib/python3.7/site-packages/reverb/pybind.py in <module>
----> 1 import tensorflow as _tf; from .libpybind import *; del _tf

ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

I have tried to export this variable: export LD_LIBRARY_PATH=/home/orie/Desktop/AI/ai_venv/lib/

I have also tried including this environment variable in my python notebook:

import os
os.environ['LD_LIBRARY_PATH'] = '/home/orie/Desktop/AI/ai_venv/lib/'

I also tried: sudo ldconfig /home/orie/Desktop/AI/ai_venv/lib I'm using Ubuntu and a virtual environment.

Thx for anyone who helps!

opened by orshemtov 16

DQN Agent Issue With Custom Environment

So I've been following the DQN agent example / tutorial and I set it up like in the example, only difference is that I built my own custom python environment which I then wrapped in TensorFlow. However, no matter how I shape my observations and action specs, I can't seem to get it to work whenever I give it an observation and request an action. Here's the error that I get:

tensorflow.python.framework.errors_impl.InvalidArgumentError: In[0] is not a matrix. Instead it has shape [10] [Op:MatMul]

Here's how I'm setting up my agent:

    layer_parameters = (10,) #10 layers deep, shape is unspecified
    
    #placeholders 
    learning_rate = 1e-3  # @param {type:"number"}
    train_step_counter = tf.Variable(0)

    #instantiate agent

    optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)
    
    env = SumoEnvironment(self._num_actions,self._num_states)
    env2 = tf_py_environment.TFPyEnvironment(env)
    q_net= q_network.QNetwork(env2.observation_spec(),env2.action_spec(),fc_layer_params = layer_parameters)
    
    print("Time step spec")
    print(env2.time_step_spec())

    agent = dqn_agent.DqnAgent(env2.time_step_spec(),
    env2.action_spec(),
    q_network=q_net,
    optimizer = optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)`

And here's how I'm setting up my environment:

`class SumoEnvironment(py_environment.PyEnvironment):

def __init__(self, no_of_Actions, no_of_Observations):

    #this means that the observation consists of a number of arrays equal to self._num_states, with datatype float32
    self._observation_spec = specs.TensorSpec(shape=(16,),dtype=np.float32,name='observation')
    #action spec, shape unknown, min is 0, max is the number of actions
    self._action_spec = specs.BoundedArraySpec(shape=(1,),dtype=np.int32,minimum=0,maximum=no_of_Actions-1,name='action')
    
   
    self._state = 0
    self._episode_ended = False`

And here is what my input / observations look like:

tf.Tensor([ 0. 0. 0. 0. 0. 0. 0. 0. -1. -1. -1. -1. 0. 0. 0. -1.], shape=(16,), dtype=float32)

I've tried experimenting with the shape and depth of my Q_Net and it seems to me that the [10] in the error is related to the shape of my q network. Setting its layer parameters to (4,) yields an error of:

tensorflow.python.framework.errors_impl.InvalidArgumentError: In[0] is not a matrix. Instead it has shape [4] [Op:MatMul]

opened by IbraheemNofal 16

Feature request make it easier to supply custom model

I tried assigning my own layers to the post_processing variable within my categorical qnetwork but i get a message that weights are shared. when i try to use then create my categorical dqn agent. It would be nice if the main categorical q network constructor allowed a parameter for you to provide a set of keras layers where the q_layer is just appended to the end like it is the the encoding network scheme. The weights will be copied for you.

opened by ben-arnao 15

AttributeError: 'tuple' object has no attribute 'rank'

Trying out the most basic example on

Windows 10
Python 3.7
tensorflow 2.1.0
tf-agents 0.4.0

Error i get

Traceback (most recent call last):
  File "src\agent.py", line 58, in <module>
    action, _states = agent.policy.action(obs)
  File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\policies\tf_policy.py", line 279, in action
    step = action_fn(time_step=time_step, policy_state=policy_state, seed=seed)
  File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\utils\common.py", line 154, in with_check_resource_vars
    return fn(*fn_args, **fn_kwargs)
  File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\policies\random_tf_policy.py", line 89, in _action
    outer_dims = nest_utils.get_outer_shape(time_step, self._time_step_spec)
  File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\utils\nest_utils.py", line 394, in get_outer_shape
    nested_tensor, spec, num_outer_dims=num_outer_dims):
  File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\utils\nest_utils.py", line 97, in is_batched_nested_tensors
    if any(spec_shape.rank is None for spec_shape in spec_shapes):
  File "C:\Users\andre\.virtualenvs\ZeusTrader\lib\site-packages\tf_agents\utils\nest_utils.py", line 97, in <genexpr>
    if any(spec_shape.rank is None for spec_shape in spec_shapes):
AttributeError: 'tuple' object has no attribute 'rank'

Code i run

import tensorflow as tf
from collections import Counter, defaultdict
from tf_agents.networks import q_network
from tf_agents.utils import common
from tf_agents.agents.dqn import dqn_agent
from tf_agents.agents.random.random_agent import RandomAgent
from tf_agents.environments import suite_gym
from environment import StockExchangeEnv01

# tried with and without..error persists
# tf.compat.v1.enable_v2_behavior()

learning_rate = 0.0001
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)

# tried both my own Environment and the basic "cartpole-v0"
train_env = StockExchangeEnv01()
env_name = 'CartPole-v0'
#train_env = suite_gym.load(env_name)

train_env.reset()
print(train_env.action_spec())
"""
# Neural Net of the Agent. This NN will get x (env) and spit out y (action).
q_net = q_network.QNetwork(
  train_env.observation_spec(),
  train_env.action_spec(),
  fc_layer_params=(100,))
print(train_env.action_spec())

#
agent = dqn_agent.DqnAgent(
  train_env.time_step_spec(),
  train_env.action_spec(),
  q_network=q_net,
  optimizer=optimizer)
"""

# tried both..dqn agent and random agent

agent = RandomAgent(
    train_env.time_step_spec(),
    train_env.action_spec()
)
agent.initialize()

obs = train_env.reset()
actions = Counter()
pnl = defaultdict(float)
total_rewards = 0.0

for i in range(300):
    #action, _states = model.predict(obs)
    action, _states = agent.policy.action(obs)
    obs, rewards, dones, info = train_env.step(action)
    actions[action[0].item()] += 1
    pnl[action[0].item()] += rewards
    total_rewards += rewards
    if dones:
        break

print('actions : {}'.format(actions))
print('rewards : {}'.format(total_rewards))

The code in tf agents gets the 'shape' from the action_spec, which is a tuple in my case. Then it tries to retrieve key "rank" from a tuple.

What am i missing?

opened by AndreyBulezyuk 15

Memory leak with DqnAgent

I have built basic DQN agent to play within CartPole environment by following the DQN tutorial: https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial However, after couple of training hours I noticed that process is increasing memory consumption substantially. I was able to simplify the training script in order to narrow down the problem and figured out that memory leaks whenever driver is using agent.policy or agent.collect_policy (replacing that one with RandomTFPolicy eliminates the issue):

import tensorflow as tf
import gc

from tf_agents.environments import suite_gym, tf_py_environment
from tf_agents.networks import q_network
from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.utils import common

tf.compat.v1.enable_v2_behavior()

# Create CartPole as TFPyEnvironment
env = suite_gym.load('CartPole-v0')
tf_env = tf_py_environment.TFPyEnvironment(env)

# Create DQN Agent
q_net = q_network.QNetwork(
        tf_env.observation_spec(),
        tf_env.action_spec(),
        fc_layer_params=(100,))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    tf_env.time_step_spec(),
    tf_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

# Replacing agent.collect_policy with tf_policy eliminates issue a of memory leak
# tf_policy = random_tf_policy.RandomTFPolicy(action_spec=train_env.action_spec(),
#                                            time_step_spec=train_env.time_step_spec())

# Create dynamic step driver with no observers
driver = dynamic_step_driver.DynamicStepDriver(
    env = tf_env,
    policy = agent.collect_policy,
    observers = [],
    num_steps = 1)

# Calls to driver end up continuously increasing memory consumption 
while True:
    driver.run()
    # One of the possible solutions is to call gc.collect() but it significantly slows down training

Other hotfix as mentioned in the code above is to call gc.collect() after each driver.run() but that has huge impact on the performance.

This memory leak prevents long-running training process which might be a bit of bummer for more complex environments based on DQN.

Running setup:

Ubuntu 20.10 / 64-bit
Python 3.8.6 + tensorflow==2.4.1 + tf-agents==0.7.1
Running on the CPU: AMD Ryzen Threadripper 3960x
RAM: 128GB

Same script has been also run within Docker container and confirmed memory leak.

What could be possible cause for this problem and how to properly fix it?

opened by romandunets 14

OOM after a couple of iterations

I am running DQN on an Atari game (BeamRider-v0). I just get the input image and flatten it and connect it to a fully connected layer with 32 neurons. It runs for 14000 iterations on a Telsa v100 GPU. After 14000 iterations, I get OOM. Is there a memory leak? I am using tf-nightly-gpu-2.0-preview. I have also tried tf-nightly-gpu and the same problem exists. My question is why I don't get the error at the very first iterations? What causes memory usage to grow for 14000 iterations?

opened by siavash-khodadadeh 14
AttributeError: module 'tree' has no attribute 'assert_same_structure'

when I import tf_agents, there is no error. However, when I run "from tf_agents.agents.dqn import dqn_agent" it gives me AttributeError: module 'tree' has no attribute 'assert_same_structure'

opened by abbiesgame 0

collect_step slow speed

Hi, I reference the TensorFlow official website example, it shows the collect_step function and usage as the following.

def collect_step(environment, policy):
  time_step = environment.current_time_step()
  action_step = policy.action(time_step)
  next_time_step = environment.step(action_step.action)
  traj = trajectory.from_transition(time_step, action_step, next_time_step)

  # Add trajectory to the replay buffer
  replay_buffer.add_batch(traj)

for _ in range(initial_collect_steps):
  collect_step(train_env, random_policy)

for _ in range(num_iterations):

  # Collect a few steps using collect_policy and save to the replay buffer.
  for _ in range(collect_steps_per_iteration):
    collect_step(train_env, agent.collect_policy)

However, when it comes to multi-step, the above code is quite slow. To my understanding, the reason it is slow is because of the communication between GPU and CPU for each action. If I am wrong please let me know.

I wonder if is there any way I can speed this up with the TensorFlow library function so that the iteration for collection_step can run inside GPU for faster training?

Thanks in advance.

Best Regards, Jack Lu

opened by jacklu333333 0

SAC minitaur with the Actor-Learner API demonstrator fails
i run the SAC minitaur with the Actor-Learner API code from the tutorial

At first I get the error that I need to upgrade tensorflow to version 2.11.0, because of incompatibility with tensorflow probability

tensorflow 2.11.0

tensorflow-estimator 2.11.0

tensorflow-intel 2.11.0

tensorflow-io-gcs-filesystem 0.27.0

tensorflow-probability 0.19.0

termcolor 2.0.1

terminado 0.17.0

tf-agents 0.15.0

after upgrade i get the following error when importing any tf_agents module File "C:\tools\lib\site-packages\tf_agents\__init__.py", line 55, in _ensure_tf_install tf_version = tf.version.VERSION AttributeError: module 'tensorflow' has no attribute 'version'
opened by ThorAvaTahr 0
Errors with numpy 1.24.0
I tried to use tf-agents, the latest version. However, if I run a simple class which only extends PyEnvironment but nothing else, I receive an with a message like

module 'numpy' does not contain attribute named 'bool'. Did you mean 'bool_'

There are several similar issues with numpy, sometimes re-installing numpy helps. In my case it didn't, I tried the common workflow of uninstalling setuptools and numpy.

I'm using:

Python 3.10 (Python 3.11 doesn't work also...)

Numpy 1.24.0

Is there anything I've left?
opened by sebastianknopf 3
PPO with Mini-Batches Tutorial
The documentation of PPO describes the training process of PPO as the following:

# Build PPO agent ppo_agent = PPOClipAgent(num_epochs=40, ...) # Build Replay Buffer replay_buffer = TFUniformReplayBuffer(data_spec=ppo_agent.collect_data_spec,batch_size=env.batch_size, max_length=1000) # Train agent experiences, _ = replay_buffer.gather_all() loss = ppo_agent.train(experiences).loss replay_buffer.clear()

However, that way you train ppo_agent with 1 large batch of experiences for 40 epochs. However, if the number of experiences is high (e.g. 1024 experiences), you might want to to train PPO on mini batches (e.g. 4 mini-batches of 256 experiences, 40 epochs each mini-batch).

The only way to do that is to build a dataset from replay_buffer and fetch experiences by iterating the dataset. However, this produces random batches, instead of equally selected mini-batches:

# Use 1 epoch per batch ppo_agent = PPOClipAgent(num_epochs=1, ...) # Build dataset iter dataset = replay_buffer.as_dataset(sample_batch_size=200, num_steps=2, num_parallel_calls=2).prefetch(2) dataset_iter = iter(dataset) # Training part loss = 0 for _ in range(40): for _ in range(4): mini_batch_experiences, _ = next(dataset_iter) loss += ppo_agent.train(mini_batch_experiences) replay_buffer.clear() loss /= (40*4)

However, this approach has the following issue: It randomly selects 256 experiences from the memory, in a uniform way, but that doesn't ensure that each experience will be equally selected. Is there a better method to train PPO? Also, for some reason, this takes way more time to train than using a single batch as in the approach, and gets worse training results, so am I missing something else here?
opened by kochlisGit 0

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

Related tags

Overview

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

Table of contents

Agents

Tutorials

Multi-Armed Bandits

Examples

Installation

Stable

Nightly

From GitHub

Contributing

Releases

Principles

Citation

Disclaimer

Comments

Owner

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".

Tensorforce: a TensorFlow library for applied reinforcement learning

TensorFlow Reinforcement Learning

ChainerRL is a deep reinforcement learning library built on top of Chainer.

A toolkit for developing and comparing reinforcement learning algorithms.

An open source robotics benchmark for meta- and multi-task reinforcement learning

Doom-based AI Research Platform for Reinforcement Learning from Raw Visual Information. :godmode:

A toolkit for reproducible reinforcement learning research.

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Deep Reinforcement Learning for Keras.

Open world survival environment for reinforcement learning

Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

Paddle-RLBooks is a reinforcement learning code study guide based on pure PaddlePaddle.

A collection of various RL algorithms like policy gradients, DQN and PPO. The goal of this repo will be to make it a go-to resource for learning about RL. How to visualize, debug and solve RL problems. I've additionally included playground.py for learning more about OpenAI gym, etc.

A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)

Source code and data from the RecSys 2020 article "Carousel Personalization in Music Streaming Apps with Contextual Bandits" by W. Bendada, G. Salha and T. Bontempelli

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.