TensorFlow Reinforcement Learning

DeepMind

Last update: Dec 29, 2022

Related tags

Reinforcement Learning trfl

Overview

TRFL

TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Learning agents.

Installation

TRFL can be installed from pip with the following command: pip install trfl

TRFL will work with both the CPU and GPU version of tensorflow, but to allow for that it does not list Tensorflow as a requirement, so you need to install Tensorflow and Tensorflow-probability separately if you haven't already done so.

Usage Example

import tensorflow as tf
import trfl

# Q-values for the previous and next timesteps, shape [batch_size, num_actions].
q_tm1 = tf.get_variable(
    "q_tm1", initializer=[[1., 1., 0.], [1., 2., 0.]], dtype=tf.float32)
q_t = tf.get_variable(
    "q_t", initializer=[[0., 1., 0.], [1., 2., 0.]], dtype=tf.float32)

# Action indices, discounts and rewards, shape [batch_size].
a_tm1 = tf.constant([0, 1], dtype=tf.int32)
r_t = tf.constant([1, 1], dtype=tf.float32)
pcont_t = tf.constant([0, 1], dtype=tf.float32)  # the discount factor

# Q-learning loss, and auxiliary data.
loss, q_learning = trfl.qlearning(q_tm1, a_tm1, r_t, pcont_t, q_t)

loss is the tensor representing the loss. For Q-learning, it is half the squared difference between the predicted Q-values and the TD targets, shape [batch_size]. Extra information is in the q_learning namedtuple, including q_learning.td_error and q_learning.target.

The loss tensor can be differentiated to derive the corresponding RL update.

reduced_loss = tf.reduce_mean(loss)
optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(reduced_loss)

All loss functions in the package return both a loss tensor and a namedtuple with extra information, using the above convention, but different functions may have different extra fields. Check the documentation of each function below for more information.

Documentation

Check out the full documentation page here.

Comments

Raise "error: could not create 'build': File exists" while installing

When I firstly install trfl, it raised error almost at the end of installation, Failed building wheel for trfl Running setup.py clean for trfl Failed to build trfl Installing collected packages: trfl Running setup.py install for trfl ... error The further issue is like

running install running build running build_py creating build error: could not create 'build': File exists

opened by ruifengma 21

ImportError: cannot import name gen_distribution_ops

When I try to import trfl, similarly to this public trfl colab notebook online, I get

(Note I tried this in both python 2 and 3 notebooks, met with the same results)

<ipython-input-3-dd69192d7d7c> in <module>()
----> 1 import trfl

/usr/local/lib/python2.7/dist-packages/trfl/__init__.py in <module>()
     29 from trfl.discrete_policy_gradient_ops import discrete_policy_gradient_loss
     30 from trfl.discrete_policy_gradient_ops import sequence_advantage_actor_critic_loss
---> 31 from trfl.dist_value_ops import categorical_dist_double_qlearning
     32 from trfl.dist_value_ops import categorical_dist_qlearning
     33 from trfl.dist_value_ops import categorical_dist_td_learning

/usr/local/lib/python2.7/dist-packages/trfl/dist_value_ops.py in <module>()
     31 import tensorflow as tf
     32 from trfl import base_ops
---> 33 from trfl import distribution_ops
     34 
     35 Extra = collections.namedtuple("dist_value_extra", ["target"])

/usr/local/lib/python2.7/dist-packages/trfl/distribution_ops.py in <module>()
     28 import tensorflow as tf
     29 import tensorflow_probability as tfp
---> 30 from trfl import gen_distribution_ops
     31 
     32 

ImportError: cannot import name gen_distribution_ops

(Also, if I install trfl via pip instead of cloning from git, error messages look similar with this added on the end)


/usr/local/lib/python2.7/dist-packages/trfl/gen_distribution_ops.py in <module>()
      1 import tensorflow as tf
----> 2 _op_lib = tf.load_op_library(tf.resource_loader.get_path_to_datafile("_gen_distribution_ops.so"))
      3 project_distribution = _op_lib.project_distribution
      4 del _op_lib, tf

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.pyc in load_op_library(library_filename)
     59     RuntimeError: when unable to load the library or get the python wrappers.
     60   """
---> 61   lib_handle = py_tf.TF_LoadLibrary(library_filename)
     62 
     63   op_list_str = py_tf.TF_GetOpList(lib_handle)

opened by ryanprinster 19

import trfl not working
I am using Spyder (Python 3.6) in ubuntu 18.04 import tensorflow

import trfl

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md

https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue.

Traceback (most recent call last):

File "", line 1, in import trfl

File "/home/dd/.local/lib/python3.6/site-packages/trfl/init.py", line 31, in from trfl.dist_value_ops import categorical_dist_double_qlearning

File "/home/dd/.local/lib/python3.6/site-packages/trfl/dist_value_ops.py", line 33, in from trfl import distribution_ops

File "/home/dd/.local/lib/python3.6/site-packages/trfl/distribution_ops.py", line 30, in from trfl import gen_distribution_ops

File "/home/dd/.local/lib/python3.6/site-packages/trfl/gen_distribution_ops.py", line 2, in _op_lib = tf.load_op_library(tf.resource_loader.get_path_to_datafile("_gen_distribution_ops.so"))

File "/home/dd/.local/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename)

NotFoundError: /home/dd/.local/lib/python3.6/site-packages/trfl/_gen_distribution_ops.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl11string_viewEPFPNS_8OpKernelEPNS_20OpKernelConstructionEE
opened by DeepakIITJ 9

Building on macOS

EDIT: Fixed the issue and amended the pull request with the modifications, see https://github.com/deepmind/trfl/pull/12#issuecomment-471858294

Background Due to the recent changes in the TRFL installation procedure, I ran into some issues running TRFL on macOS which broke my local TF dev environment. There were no pre-built wheels for macOS, so I proceeded to attempt to build from source with some modifications to build on macOS.

Environment Details

macOS 10.14.3
Bazel 0.23.1
TensorFlow 1.12
TensorFlow Probability 0.5

Changes & Issues The build_pip_pkg.sh script was updated to check for Darwin platforms, and instead use greadlink -f (from GNU Core Utils via Homebrew). No other changes were needed to proceed with the build.

The build seemingly went smoothly on macOS (and reproduced on a Linux/Ubuntu machine), see the full console outputs below. However, when importing TRFL using import trfl, I get the the error: AttributeError: module '29f0280e24eacea242fe31b5dab40eba' has no attribute 'L_LL_ProjectL_Distribution' (see trace below)

I am currently unable to pinpoint the source of the problem as I'm not sure if I am missing some underlying Linux-only assumption somewhere in the build process, so any help would be really appreciated!

Stack Trace

Traceback (most recent call last):
  File "experiment_agent.py", line 19, in <module>
    from agents.actor import Actor
  File "/Users/Abdel/Developer/code/agents/actor.py", line 9, in <module>
    import trfl
  File "/Users/Abdel/Developer/anaconda/lib/python3.6/site-packages/trfl/__init__.py", line 31, in <module>
    from trfl.dist_value_ops import categorical_dist_double_qlearning
  File "/Users/Abdel/Developer/anaconda/lib/python3.6/site-packages/trfl/dist_value_ops.py", line 33, in <module>
    from trfl import distribution_ops
  File "/Users/Abdel/Developer/anaconda/lib/python3.6/site-packages/trfl/distribution_ops.py", line 30, in <module>
    from trfl import gen_distribution_ops
  File "/Users/Abdel/Developer/anaconda/lib/python3.6/site-packages/trfl/gen_distribution_ops.py", line 3, in <module>
    L_LL_ProjectL_Distribution = _op_lib.L_LL_ProjectL_Distribution
AttributeError: module '29f0280e24eacea242fe31b5dab40eba' has no attribute 'L_LL_ProjectL_Distribution'

Output of pip show -f trfl

Name: trfl
Version: 1.0
Summary: trfl is a library of building blocks for reinforcement learning algorithms.
Home-page: http://www.github.com/deepmind/trfl/
Author: DeepMind
Author-email: [email protected]
License: Apache 2.0
Location: /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages
Requires: dm-sonnet, absl-py, six, wrapt, numpy
Required-by:
Files:
  trfl-1.0.dist-info/INSTALLER
  trfl-1.0.dist-info/METADATA
  trfl-1.0.dist-info/RECORD
  trfl-1.0.dist-info/WHEEL
  trfl-1.0.dist-info/top_level.txt
  trfl/__init__.py
  trfl/__pycache__/__init__.cpython-36.pyc
  trfl/__pycache__/action_value_ops.cpython-36.pyc
  trfl/__pycache__/base_ops.cpython-36.pyc
  trfl/__pycache__/clipping_ops.cpython-36.pyc
  trfl/__pycache__/discrete_policy_gradient_ops.cpython-36.pyc
  trfl/__pycache__/dist_value_ops.cpython-36.pyc
  trfl/__pycache__/distribution_ops.cpython-36.pyc
  trfl/__pycache__/dpg_ops.cpython-36.pyc
  trfl/__pycache__/gen_distribution_ops.cpython-36.pyc
  trfl/__pycache__/indexing_ops.cpython-36.pyc
  trfl/__pycache__/periodic_ops.cpython-36.pyc
  trfl/__pycache__/pixel_control_ops.cpython-36.pyc
  trfl/__pycache__/policy_gradient_ops.cpython-36.pyc
  trfl/__pycache__/retrace_ops.cpython-36.pyc
  trfl/__pycache__/sequence_ops.cpython-36.pyc
  trfl/__pycache__/target_update_ops.cpython-36.pyc
  trfl/__pycache__/value_ops.cpython-36.pyc
  trfl/__pycache__/vtrace_ops.cpython-36.pyc
  trfl/_gen_distribution_ops.so
  trfl/action_value_ops.py
  trfl/base_ops.py
  trfl/clipping_ops.py
  trfl/discrete_policy_gradient_ops.py
  trfl/dist_value_ops.py
  trfl/distribution_ops.py
  trfl/dpg_ops.py
  trfl/gen_distribution_ops.py
  trfl/indexing_ops.py
  trfl/periodic_ops.py
  trfl/pixel_control_ops.py
  trfl/policy_gradient_ops.py
  trfl/retrace_ops.py
  trfl/sequence_ops.py
  trfl/target_update_ops.py
  trfl/value_ops.py
  trfl/vtrace_ops.py

Console Output from Building TRFL

in trfl/ on master
› ./configure.sh
rm: .bazelrc: No such file or directory
using installed tensorflow

in trfl/ on master
› bazel build -c opt :build_pip_pkg
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
WARNING: /private/var/tmp/_bazel_Abdel/75cbfa485f5681f3bc2b2ed75d0c5ecd/external/local_config_tf/BUILD:3504:1: target 'libtensorflow_framework.so' is both a rule and a file; please choose another name for the
 rule
WARNING: /private/var/tmp/_bazel_Abdel/75cbfa485f5681f3bc2b2ed75d0c5ecd/external/local_config_tf/BUILD:5:12: in hdrs attribute of cc_library rule @local_config_tf//:tf_header_lib: file '_api_implementation.so
' from target '@local_config_tf//:tf_header_include' is not allowed in hdrs
WARNING: /private/var/tmp/_bazel_Abdel/75cbfa485f5681f3bc2b2ed75d0c5ecd/external/local_config_tf/BUILD:5:12: in hdrs attribute of cc_library rule @local_config_tf//:tf_header_lib: file '_message.so' from targ
et '@local_config_tf//:tf_header_include' is not allowed in hdrs
INFO: Analysed target //:build_pip_pkg (18 packages loaded, 254 targets configured).
INFO: Found 1 target...
Target //:build_pip_pkg up-to-date:
  bazel-bin/build_pip_pkg
INFO: Elapsed time: 31.389s, Critical Path: 18.91s
INFO: 5 processes: 5 darwin-sandbox.
INFO: Build completed successfully, 11 total actions

in trfl/ on master
› mkdir /tmp/trfl_wheels

in trfl/ on master
› ./bazel-bin/build_pip_pkg /tmp/trfl_wheels
++ uname -s
++ tr A-Z a-z
+ PLATFORM=darwin
+ PIP_FILE_PREFIX=bazel-bin/build_pip_pkg.runfiles/__main__/
+ main /tmp/trfl_wheels
+ [[ ! -z /tmp/trfl_wheels ]]
+ [[ /tmp/trfl_wheels == \m\a\k\e ]]
+ DEST=/tmp/trfl_wheels
+ shift
+ [[ ! -z '' ]]
+ [[ -z /tmp/trfl_wheels ]]
+ mkdir -p /tmp/trfl_wheels
+ [[ darwin == \d\a\r\w\i\n ]]
++ greadlink -f /tmp/trfl_wheels
+ DEST=/private/tmp/trfl_wheels
+ echo '=== destination directory: /private/tmp/trfl_wheels'
=== destination directory: /private/tmp/trfl_wheels
++ mktemp -d -t tmp.XXXXXXXXXX
+ TMPDIR=/var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
++ date
+ echo Wed 6 Mar 2019 13:18:36 AEDT : '=== Using tmpdir: /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2'
Wed 6 Mar 2019 13:18:36 AEDT : === Using tmpdir: /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
+ echo '=== Copy TRFL files'
=== Copy TRFL files
+ cp bazel-bin/build_pip_pkg.runfiles/__main__/LICENSE /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
+ cp bazel-bin/build_pip_pkg.runfiles/__main__/MANIFEST.in /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
+ cp bazel-bin/build_pip_pkg.runfiles/__main__/setup.py /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
+ rsync -avm -L '--exclude=*_test.py' bazel-bin/build_pip_pkg.runfiles/__main__/trfl /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
building file list ... done
trfl/
trfl/__init__.py
trfl/_gen_distribution_ops.so
trfl/action_value_ops.py
trfl/base_ops.py
trfl/clipping_ops.py
trfl/discrete_policy_gradient_ops.py
trfl/dist_value_ops.py
trfl/distribution_ops.py
trfl/dpg_ops.py
trfl/gen_distribution_ops.py
trfl/indexing_ops.py
trfl/periodic_ops.py
trfl/pixel_control_ops.py
trfl/policy_gradient_ops.py
trfl/retrace_ops.py
trfl/sequence_ops.py
trfl/target_update_ops.py
trfl/value_ops.py
trfl/vtrace_ops.py

sent 210894 bytes  received 444 bytes  140892.00 bytes/sec
total size is 209421  speedup is 0.99
+ pushd /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
/var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2 ~/trfl
++ date
+ echo Wed 6 Mar 2019 13:18:37 AEDT : '=== Building wheel'
Wed 6 Mar 2019 13:18:37 AEDT : === Building wheel
+ python setup.py bdist_wheel
warning: no files found matching '*.dll' under directory 'trfl/'
warning: no files found matching '*.lib' under directory 'trfl/'
warning: no files found matching '*.pyd' under directory 'trfl/'
+ cp dist/trfl-1.0-cp36-cp36m-macosx_10_7_x86_64.whl /private/tmp/trfl_wheels
+ popd
~/trfl
+ rm -rf /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
++ date
+ echo Wed 6 Mar 2019 13:18:38 AEDT : '=== Output wheel file is in: /private/tmp/trfl_wheels'
Wed 6 Mar 2019 13:18:38 AEDT : === Output wheel file is in: /private/tmp/trfl_wheels

in trfl/ on master
› pip install /tmp/trfl_wheels/*.whl
Processing /tmp/trfl_wheels/trfl-1.0-cp36-cp36m-macosx_10_7_x86_64.whl
Requirement already satisfied: six in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from trfl==1.0) (1.11.0)
Requirement already satisfied: dm-sonnet in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from trfl==1.0) (1.27)
Requirement already satisfied: absl-py in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from trfl==1.0) (0.2.0)
Requirement already satisfied: wrapt in /Users/Abdel/.local/lib/python3.6/site-packages (from trfl==1.0) (1.10.11)
Requirement already satisfied: numpy in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from trfl==1.0) (1.14.5)
Requirement already satisfied: semantic-version in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from dm-sonnet->trfl==1.0) (2.6.0)
Requirement already satisfied: contextlib2 in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from dm-sonnet->trfl==1.0) (0.5.5)
Installing collected packages: trfl
Successfully installed trfl-1.0

opened by abdel 6

Issue with pip install trfl on MacOs

Hello,

I get the following error when using pip install trfl

Could not find a version that satisfies the requirement trfl (from versions: ) No matching distribution found for trfl

I have tensorflow 1.13.1 & tensorflow-probability 0.60

Do you have an idea what the issue could be? Thanks in advance for your help

opened by HenrikMettler 3
Clarification of some abbreviations?
Dear Deepminder:

During a group meeting I was raised a question about the meanings of abbreviations in the demo code of TRFL when I tried to introduce TRFL to my lab members. So I have to ask it here.

It reads:

q_tm1: the action value in the source state of a transition. a_tm1: the action that was selected in the source state.

What does m1 mean here? I know "q" stands for action value, "t" stands for time step, I tried to figure "m1" stands for what, but it is not so intuitive.

Could you please help me on that? Thanks a lot.
opened by mingyr 3

policy_gradient_loss batch_shape requirements

Why does policy_gradient_ops.policy_gradient_loss require batch_shape to be a rank 2 tensor? This would limit the policy_gradient_loss operation to only single univariate distributions that implement log_prob?

For instance, consider the problem where the actions are multivariate and follow a normal distribution:

>>> import tensorflow as tf; tf.enable_eager_execution()
>>> import tensorflow.contrib.eager as tfe
>>> import tensorflow_probability as tfp
>>> loc = tfe.Variable(tf.zeros([5, 5, 2]))
>>> policy = tfp.distributions.Normal(loc=loc, scale=1.)
<tfp.distributions.Normal 'Normal/' batch_shape=(5, 5, 2) event_shape=() dtype=float32>
>>> trfl.policy_gradient_loss(policy, tf.zeros([5, 5, 2]), tf.ones([5, 5]), [loc])
Traceback (most recent call last):
  File "/trfl/policy_gradient_ops.py", line 119, in policy_gradient_loss
    policies_.batch_shape.assert_has_rank(2)
  File "/tensorflow/python/framework/tensor_shape.py", line 728, in assert_has_rank
    raise ValueError("Shape %s must have rank %d" % (self, rank))
ValueError: Shape (5, 5, 2) must have rank 2

I could understand how it is a requirement for a discrete distribution. But for the sake of supporting other distributions, it may be more structured to require the log_prob to be rank 3 and then perform a summation operation over the leading dimension:

>>> policy.log_prob(tf.zeros([5, 5, 2])).shape
TensorShape([Dimension(5), Dimension(5), Dimension(2)])
>>> tf.reduce_sum(policy.log_prob(tf.zeros([5, 5, 2])), axis=-1).shape
TensorShape([Dimension(5), Dimension(5)])

Thank you for your support and time.

opened by wenkesj 2

Add/alias dpg critic update
Hi, the DPG critic update (see Algorithm 1 of Lillicrap et al. 2016, https://arxiv.org/abs/1509.02971) is substantively the same as your td_learning function; however, this is currently obscured. I would suggest adding a dpg_qlearning function that aliases td_learning in dpg_ops.py:

from trfl.value_ops import td_learning ... dpg_qlearning = td_learning

Alternatively, one could add a comment referencing the td_learning fn in the dpg actor update fn.
opened by spitis 2
Fixing documentation highlights

Fix unhighlighted function parameters (trfl/dpg_ops.py line 61), fix mis-highlighted parameters (trfl/dpg_ops.py line 54~58), fix the unhighlighted code snippets (trfl/sequence_ops.py), fix unhighlighted shapes (everything else).

opened by zuoanqh 2

Questions about retrace implementation

Hey,

I was looking at the retrace ops provided by trfl and there are a couple of implementation details that seem a bit confusing to me.

It seems like trfl retrace drops the discount terms from the 𝔼_π Q(x_t, .) term. This is in line with the retrace formulation in Equation 13 in MPO paper [1], but is different from Equation 4 in the original retrace paper [2]. I have included a small test case below that shows this. Is this a bug or a conscious choice? Edit: actually, it seems like at least one of the terms is included in the continuation probs.
In retrace_ops._general_off_policy_corrected_multistep_target comments, it's mentioned that exp_q_t = 𝔼_π Q(x_{t+1},.) and qa_t = Q(x_t, a_t), indicating that exp_q_t should be one timestep ahead of qa_t: https://github.com/deepmind/trfl/blob/e633edbd9d326b8bebc7c7c7d53f37118b48a440/trfl/retrace_ops.py#L252-L253 However, If I understand this correctly, when those values are actually assigned, they come from the same time indices: https://github.com/deepmind/trfl/blob/e633edbd9d326b8bebc7c7c7d53f37118b48a440/trfl/retrace_ops.py#L263-L264 It's possible that the target_policy_t values that are used to index for exp_q_t somehow account this, but I can't wrap my head around how that would do it. Am I misunderstanding something here or is it possible that these indices are actually off?

[1] Abdolmaleki, A., Springenberg, J.T., Tassa, Y., Munos, R., Heess, N. and Riedmiller, M., 2018. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920. [2] Munos, R., Stepleton, T., Harutyunyan, A. and Bellemare, M., 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems (pp. 1054-1062).

Code related to question 1 (click to expand):

The test case is simplified (e.g. just one action) and I have used a slightly modified version of trfl to make it compatible with tf2, but all the logic should be the correct.

import numpy as np
import tensorflow as tf

from trfl import retrace_ops


lambda_ = 0.99
discount = 0.9
Q_values = np.array([
    [[2.2], [5.2]],
    [[7.2], [4.2]],
    [[3.2], [4.2]],
    [[2.2], [9.2]]], dtype=np.float32)
target_Q_values = np.array([
    [[2.], [5.]],
    [[7.], [4.]],
    [[3.], [4.]],
    [[2.], [9.]]], dtype=np.float32)
actions = np.array([
    [0, 0],
    [0, 0],
    [0, 0],
    [0, 0]])
rewards = np.array([
    [1.9, 2.9],
    [3.9, 4.9],
    [5.9, 6.9],
    [np.nan, np.nan],  # nan marks entries we should never use.
], dtype=np.float32)
pcontinues = np.array([
    [0.8, 0.9],
    [0.7, 0.8],
    [0.6, 0.5],
    [np.nan, np.nan]], dtype=np.float32)
target_policy_probs = np.array([
    [[np.nan] * 1, [np.nan] * 1],
    [[1.0], [1.0]],
    [[1.0], [1.0]],
    [[1.0], [1.0]]], dtype=np.float32)
behavior_policy_probs = np.array([
    [np.nan, np.nan],
    [1.0, 1.0],
    [1.0, 1.0],
    [1.0, 1.0]], dtype=np.float32)


def retrace_original_v1(
        lambda_,
        discount,
        target_Q_values,
        actions,
        rewards,
        target_policy_probs,
        behavior_policy_probs):
    actions = actions[1:, ...]
    rewards = rewards[:-1, ...]

    target_policy_probs = target_policy_probs[1:, ...]
    behavior_policy_probs = behavior_policy_probs[1:, ...]

    traces = lambda_ * np.minimum(
        1.0, target_policy_probs / behavior_policy_probs[..., None])

    deltas = (
        rewards[..., None]
        + discount * target_Q_values[1:]
        - target_Q_values[:-1])
    retraces = []
    for i in range(tf.shape(traces)[0]):
        sum_terms = []
        for t in range(i, tf.shape(traces)[0]):
            trace = tf.reduce_prod([
                traces[k]
                for k in range(i + 1, t + 1)
            ], axis=0)
            sum_term = discount ** (t - i) * trace * deltas[t]
            sum_terms.append(sum_term)

        result = tf.reduce_sum(sum_terms, axis=0)
        retraces.append(result)

    retraces = tf.stack(retraces) + target_Q_values[:-1]
    return retraces


output_original_v1 = retrace_original_v1(
    lambda_,
    1.0,
    target_Q_values,
    actions,
    rewards,
    target_policy_probs,
    behavior_policy_probs)
print(f"output_original_v1:\n{output_original_v1.numpy().round(3)}\n")

output_original_discounted_v1 = retrace_original_v1(
    lambda_,
    discount,
    target_Q_values,
    actions,
    rewards,
    target_policy_probs,
    behavior_policy_probs)
print(f"output_original_discounted_v1:\n{output_original_discounted_v1.numpy().round(3)}\n")


output_trfl_v1 = retrace_ops.retrace(
    lambda_,
    Q_values,
    target_Q_values,
    actions,
    rewards,
    tf.ones_like(rewards),
    target_policy_probs,
    behavior_policy_probs,
).extra.target[..., None]


tf.debugging.assert_near(output_original_v1, output_trfl_v1)  # succeeds
tf.debugging.assert_near(output_original_discounted_v1, output_trfl_v1)  # fails

opened by hartikainen 1

How is deterministic policy gradient being evaluated?

I cannot grasp the steps for lines 87 to 92 in trfl/blob/master/trfl/dpg_ops.py. Why is a target_a being created? The subsequent stop_gradient is understandable since we don't want to update the Q-network's trainable variables. But then, what does this loss represent in the next line? DPG to me is an application of the chain rule. How is the optimization of loss helping update the network?

I don't know if there is a better way to ask this question as I could not contact the authors of the dpg_ops.py (mainly Matteo Hessel and Miljan Martic) by any other means.

opened by AvisekNaug 1
Fix legal_actions_mask bug in epsilon_greedy().

This PR addresses Issue https://github.com/deepmind/trfl/issues/27

Note that the bug must be addressed in two places. First, when selecting max_value - it must only be selected from legal actions. Second, when computing greedy_probs - there could be multiple action values achieving the max, but not all of them legal.

Also, I added legal_actions_mask to the list of values in the tf.name_scope context manager.

Aside from that, when there is no legal actions mask the epsilon_greedy function should execute exactly the same as before.

Happy to implement any small changes and if there’s a better fix altogether feel free to close this and implement it internally. Just thought I’d offer up a solution :)

opened by jhtschultz 0
Legal actions mask bug
Found a bug in epsilon_greedy() in policy_ops.py when applying legal_actions_mask. It fails when masking the action with the highest action value.

For example:

action_values = [2.0, 1.0, 1.0] legal_actions_mask = [0., 1., 1.] epsilon = 0.1 result = policy_ops.epsilon_greedy(action_values, epsilon, legal_actions_mask).probs

Outputs: [0.9 0.05 0.05]
opened by jhtschultz 0

Retrace Ops: documented return shapes

Hi, it seems like the documented returns shapes for the following functions might be off:

retrace_ops.retrace(...)
retrace_ops.retrace_core(...)
retrace_ops._general_off_policy_corrected_multistep_target(...)

The first two are documented to return shape [B] and third shape [T, B, num_actions], while they all appear to return [T, B].

Some test code to check.

import numpy as np
import tensorflow as tf

from trfl import retrace_ops, indexing_ops


### Example input data: 
# https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops_test.py#L41

lambda_ = 0.9
qs = [
    [[2.2, 3.2, 4.2],
     [5.2, 6.2, 7.2]],
    [[7.2, 6.2, 5.2],
     [4.2, 3.2, 2.2]],
    [[3.2, 5.2, 7.2],
     [4.2, 6.2, 9.2]],
    [[2.2, 8.2, 4.2],
     [9.2, 1.2, 8.2]]
     ]
targnet_qs = [
    [[2., 3., 4.],
     [5., 6., 7.]],
    [[7., 6., 5.],
     [4., 3., 2.]],
    [[3., 5., 7.],
     [4., 6., 9.]],
    [[2., 8., 4.],
     [9., 1., 8.]]
     ]
actions = [
    [2, 0], 
    [1, 2], 
    [0, 1], 
    [2, 0]
    ]
rewards = [
    [1.9, 2.9], 
    [3.9, 4.9], 
    [5.9, 6.9], 
    [np.nan, np.nan]  # nan marks entries we should never use.
    ]
pcontinues = [
    [0.8, 0.9], 
    [0.7, 0.8], 
    [0.6, 0.5], 
    [np.nan, np.nan]
    ]
target_policy_probs = [
    [[np.nan] * 3,
     [np.nan] * 3],
    [[0.41, 0.28, 0.31],
     [0.19, 0.77, 0.04]],
    [[0.22, 0.44, 0.34],
     [0.14, 0.25, 0.61]],
    [[0.16, 0.72, 0.12],
     [0.33, 0.30, 0.37]]
     ]
behaviour_policy_probs = [
    [np.nan, np.nan], 
    [0.85, 0.86], 
    [0.87, 0.88], 
    [0.89, 0.84]
    ]

### Retrace Test: ###
retrace = retrace_ops.retrace(
        lambda_, qs, targnet_qs, actions, rewards,
        pcontinues, target_policy_probs, behaviour_policy_probs)

# qs: shape [(T+1), B, num_actions] 
# https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops.py#L85
T = len(qs) - 1  # sequence length
B = len(qs[0])  # batch dimension
N = len(qs[0][0])  # number of actions

# loss: documented shape [B] 
# https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops.py#L121
tf.debugging.assert_equal(retrace.loss.shape, [T, B])  # succeeds

### Multi-step target Test: ###
timesteps = tf.shape(qs)[0] # Batch size is qs_shape[1].
timestep_indices_tm1 = tf.range(0, timesteps - 1)
timestep_indices_t = tf.range(1, timesteps)

target_policy_t = tf.gather(target_policy_probs, timestep_indices_t)
behaviour_policy_t = tf.gather(behaviour_policy_probs, timestep_indices_t)
a_t = tf.gather(actions, timestep_indices_t)
r_t = tf.gather(rewards, timestep_indices_tm1)
pcont_t = tf.gather(pcontinues, timestep_indices_tm1)
targnet_q_t = tf.gather(targnet_qs, timestep_indices_t)

c_t = retrace_ops._retrace_weights(
        indexing_ops.batched_index(target_policy_t, a_t),
        behaviour_policy_t) * lambda_

target = retrace_ops._general_off_policy_corrected_multistep_target(
  r_t, pcont_t, target_policy_t, c_t, targnet_q_t, a_t
)

# target: documented shape [T, B, N] 
# https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops.py#L241
tf.debugging.assert_equal(target.shape, [T, B])  # succeeds

opened by tseyde 0

Pre-built python 3.7 packages

Pre-built wheel packages do not have python 3.7 --- https://pypi.org/project/trfl/#files.

Since this library does not depend on old behaviors Python 3 (if I am not wrong), it would be great to upload py37 packages to pypi.

opened by wookayin 0

Owner

DeepMind

GitHub

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning. TF-Agents makes implementing, de

2.4k Dec 29, 2022

Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

3.2k Jan 2, 2023

TensorFlow Reinforcement Learning

TRFL TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Le

3.1k Dec 29, 2022

A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

29.6k Jan 1, 2023

Doom-based AI Research Platform for Reinforcement Learning from Raw Visual Information. :godmode:

ViZDoom ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is primarily intended for research

1.5k Dec 30, 2022

A toolkit for reproducible reinforcement learning research.

garage garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementa

1.6k Jan 9, 2023

An open source robotics benchmark for meta- and multi-task reinforcement learning

Meta-World Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic

823 Jan 6, 2023

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Status: Maintenance (expect bug fixes and minor updates) Baselines OpenAI Baselines is a set of high-quality implementations of reinforcement learning

13.5k Jan 7, 2023

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Stable Baselines Stable Baselines is a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines. You can read a

3.7k Jan 1, 2023

A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)

Applied Reinforcement Learning @ Facebook Overview ReAgent is an open source end-to-end platform for applied reinforcement learning (RL) developed and

3.3k Jan 5, 2023

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Dopamine Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grok

10k Jan 7, 2023

Deep Reinforcement Learning for Keras.

Deep Reinforcement Learning for Keras What is it? keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seaml

5.4k Jan 4, 2023

ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Ch

1.1k Dec 26, 2022

Open world survival environment for reinforcement learning

Crafter Open world survival environment for reinforcement learning. Highlights Crafter is a procedurally generated 2D world, where the agent finds foo

213 Jan 5, 2023

Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

MARL Tricks Our codes for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implemented and standardiz

404 Dec 25, 2022

Paddle-RLBooks is a reinforcement learning code study guide based on pure PaddlePaddle.

Paddle-RLBooks Welcome to Paddle-RLBooks which is a reinforcement learning code study guide based on pure PaddlePaddle. 欢迎来到Paddle-RLBooks，该仓库主要是针对强化学

117 Dec 12, 2022

A collection of various RL algorithms like policy gradients, DQN and PPO. The goal of this repo will be to make it a go-to resource for learning about RL. How to visualize, debug and solve RL problems. I've additionally included playground.py for learning more about OpenAI gym, etc.

Reinforcement Learning (PyTorch) ?? + ?? = ❤️ This repo will contain PyTorch implementation of various fundamental RL algorithms. It's aimed at making

123 Dec 23, 2022

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

Coach Coach is a python reinforcement learning framework containing implementation of many state-of-the-art algorithms. It exposes a set of easy-to-us

2.2k Jan 5, 2023

Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

CQL-JAX This repository implements Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX (FLAX). Implementation is built on

8 Nov 7, 2022

Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

4 Apr 15, 2022