TensorFlow Reinforcement Learning

Overview

TRFL

TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Learning agents.

Installation

TRFL can be installed from pip with the following command: pip install trfl

TRFL will work with both the CPU and GPU version of tensorflow, but to allow for that it does not list Tensorflow as a requirement, so you need to install Tensorflow and Tensorflow-probability separately if you haven't already done so.

Usage Example

import tensorflow as tf
import trfl

# Q-values for the previous and next timesteps, shape [batch_size, num_actions].
q_tm1 = tf.get_variable(
    "q_tm1", initializer=[[1., 1., 0.], [1., 2., 0.]], dtype=tf.float32)
q_t = tf.get_variable(
    "q_t", initializer=[[0., 1., 0.], [1., 2., 0.]], dtype=tf.float32)

# Action indices, discounts and rewards, shape [batch_size].
a_tm1 = tf.constant([0, 1], dtype=tf.int32)
r_t = tf.constant([1, 1], dtype=tf.float32)
pcont_t = tf.constant([0, 1], dtype=tf.float32)  # the discount factor

# Q-learning loss, and auxiliary data.
loss, q_learning = trfl.qlearning(q_tm1, a_tm1, r_t, pcont_t, q_t)

loss is the tensor representing the loss. For Q-learning, it is half the squared difference between the predicted Q-values and the TD targets, shape [batch_size]. Extra information is in the q_learning namedtuple, including q_learning.td_error and q_learning.target.

The loss tensor can be differentiated to derive the corresponding RL update.

reduced_loss = tf.reduce_mean(loss)
optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(reduced_loss)

All loss functions in the package return both a loss tensor and a namedtuple with extra information, using the above convention, but different functions may have different extra fields. Check the documentation of each function below for more information.

Documentation

Check out the full documentation page here.

Comments
  • Raise

    Raise "error: could not create 'build': File exists" while installing

    When I firstly install trfl, it raised error almost at the end of installation, Failed building wheel for trfl Running setup.py clean for trfl Failed to build trfl Installing collected packages: trfl Running setup.py install for trfl ... error The further issue is like

    running install running build running build_py creating build error: could not create 'build': File exists

    opened by ruifengma 21
  • ImportError: cannot import name gen_distribution_ops

    ImportError: cannot import name gen_distribution_ops

    When I try to import trfl, similarly to this public trfl colab notebook online, I get

    (Note I tried this in both python 2 and 3 notebooks, met with the same results)

    <ipython-input-3-dd69192d7d7c> in <module>()
    ----> 1 import trfl
    
    /usr/local/lib/python2.7/dist-packages/trfl/__init__.py in <module>()
         29 from trfl.discrete_policy_gradient_ops import discrete_policy_gradient_loss
         30 from trfl.discrete_policy_gradient_ops import sequence_advantage_actor_critic_loss
    ---> 31 from trfl.dist_value_ops import categorical_dist_double_qlearning
         32 from trfl.dist_value_ops import categorical_dist_qlearning
         33 from trfl.dist_value_ops import categorical_dist_td_learning
    
    /usr/local/lib/python2.7/dist-packages/trfl/dist_value_ops.py in <module>()
         31 import tensorflow as tf
         32 from trfl import base_ops
    ---> 33 from trfl import distribution_ops
         34 
         35 Extra = collections.namedtuple("dist_value_extra", ["target"])
    
    /usr/local/lib/python2.7/dist-packages/trfl/distribution_ops.py in <module>()
         28 import tensorflow as tf
         29 import tensorflow_probability as tfp
    ---> 30 from trfl import gen_distribution_ops
         31 
         32 
    
    ImportError: cannot import name gen_distribution_ops
    

    (Also, if I install trfl via pip instead of cloning from git, error messages look similar with this added on the end)

    
    /usr/local/lib/python2.7/dist-packages/trfl/gen_distribution_ops.py in <module>()
          1 import tensorflow as tf
    ----> 2 _op_lib = tf.load_op_library(tf.resource_loader.get_path_to_datafile("_gen_distribution_ops.so"))
          3 project_distribution = _op_lib.project_distribution
          4 del _op_lib, tf
    
    /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.pyc in load_op_library(library_filename)
         59     RuntimeError: when unable to load the library or get the python wrappers.
         60   """
    ---> 61   lib_handle = py_tf.TF_LoadLibrary(library_filename)
         62 
         63   op_list_str = py_tf.TF_GetOpList(lib_handle)
    
    opened by ryanprinster 19
  • import trfl not working

    import trfl not working

    I am using Spyder (Python 3.6) in ubuntu 18.04 import tensorflow

    import trfl

    WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

    • https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
    • https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue.

    Traceback (most recent call last):

    File "", line 1, in import trfl

    File "/home/dd/.local/lib/python3.6/site-packages/trfl/init.py", line 31, in from trfl.dist_value_ops import categorical_dist_double_qlearning

    File "/home/dd/.local/lib/python3.6/site-packages/trfl/dist_value_ops.py", line 33, in from trfl import distribution_ops

    File "/home/dd/.local/lib/python3.6/site-packages/trfl/distribution_ops.py", line 30, in from trfl import gen_distribution_ops

    File "/home/dd/.local/lib/python3.6/site-packages/trfl/gen_distribution_ops.py", line 2, in _op_lib = tf.load_op_library(tf.resource_loader.get_path_to_datafile("_gen_distribution_ops.so"))

    File "/home/dd/.local/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename)

    NotFoundError: /home/dd/.local/lib/python3.6/site-packages/trfl/_gen_distribution_ops.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl11string_viewEPFPNS_8OpKernelEPNS_20OpKernelConstructionEE

    opened by DeepakIITJ 9
  • Building on macOS

    Building on macOS

    EDIT: Fixed the issue and amended the pull request with the modifications, see https://github.com/deepmind/trfl/pull/12#issuecomment-471858294


    Background Due to the recent changes in the TRFL installation procedure, I ran into some issues running TRFL on macOS which broke my local TF dev environment. There were no pre-built wheels for macOS, so I proceeded to attempt to build from source with some modifications to build on macOS.

    Environment Details

    • macOS 10.14.3
    • Bazel 0.23.1
    • TensorFlow 1.12
    • TensorFlow Probability 0.5

    Changes & Issues The build_pip_pkg.sh script was updated to check for Darwin platforms, and instead use greadlink -f (from GNU Core Utils via Homebrew). No other changes were needed to proceed with the build.

    The build seemingly went smoothly on macOS (and reproduced on a Linux/Ubuntu machine), see the full console outputs below. However, when importing TRFL using import trfl, I get the the error: AttributeError: module '29f0280e24eacea242fe31b5dab40eba' has no attribute 'L_LL_ProjectL_Distribution' (see trace below)

    I am currently unable to pinpoint the source of the problem as I'm not sure if I am missing some underlying Linux-only assumption somewhere in the build process, so any help would be really appreciated!

    Stack Trace

    Traceback (most recent call last):
      File "experiment_agent.py", line 19, in <module>
        from agents.actor import Actor
      File "/Users/Abdel/Developer/code/agents/actor.py", line 9, in <module>
        import trfl
      File "/Users/Abdel/Developer/anaconda/lib/python3.6/site-packages/trfl/__init__.py", line 31, in <module>
        from trfl.dist_value_ops import categorical_dist_double_qlearning
      File "/Users/Abdel/Developer/anaconda/lib/python3.6/site-packages/trfl/dist_value_ops.py", line 33, in <module>
        from trfl import distribution_ops
      File "/Users/Abdel/Developer/anaconda/lib/python3.6/site-packages/trfl/distribution_ops.py", line 30, in <module>
        from trfl import gen_distribution_ops
      File "/Users/Abdel/Developer/anaconda/lib/python3.6/site-packages/trfl/gen_distribution_ops.py", line 3, in <module>
        L_LL_ProjectL_Distribution = _op_lib.L_LL_ProjectL_Distribution
    AttributeError: module '29f0280e24eacea242fe31b5dab40eba' has no attribute 'L_LL_ProjectL_Distribution'
    

    Output of pip show -f trfl

    Name: trfl
    Version: 1.0
    Summary: trfl is a library of building blocks for reinforcement learning algorithms.
    Home-page: http://www.github.com/deepmind/trfl/
    Author: DeepMind
    Author-email: [email protected]
    License: Apache 2.0
    Location: /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages
    Requires: dm-sonnet, absl-py, six, wrapt, numpy
    Required-by:
    Files:
      trfl-1.0.dist-info/INSTALLER
      trfl-1.0.dist-info/METADATA
      trfl-1.0.dist-info/RECORD
      trfl-1.0.dist-info/WHEEL
      trfl-1.0.dist-info/top_level.txt
      trfl/__init__.py
      trfl/__pycache__/__init__.cpython-36.pyc
      trfl/__pycache__/action_value_ops.cpython-36.pyc
      trfl/__pycache__/base_ops.cpython-36.pyc
      trfl/__pycache__/clipping_ops.cpython-36.pyc
      trfl/__pycache__/discrete_policy_gradient_ops.cpython-36.pyc
      trfl/__pycache__/dist_value_ops.cpython-36.pyc
      trfl/__pycache__/distribution_ops.cpython-36.pyc
      trfl/__pycache__/dpg_ops.cpython-36.pyc
      trfl/__pycache__/gen_distribution_ops.cpython-36.pyc
      trfl/__pycache__/indexing_ops.cpython-36.pyc
      trfl/__pycache__/periodic_ops.cpython-36.pyc
      trfl/__pycache__/pixel_control_ops.cpython-36.pyc
      trfl/__pycache__/policy_gradient_ops.cpython-36.pyc
      trfl/__pycache__/retrace_ops.cpython-36.pyc
      trfl/__pycache__/sequence_ops.cpython-36.pyc
      trfl/__pycache__/target_update_ops.cpython-36.pyc
      trfl/__pycache__/value_ops.cpython-36.pyc
      trfl/__pycache__/vtrace_ops.cpython-36.pyc
      trfl/_gen_distribution_ops.so
      trfl/action_value_ops.py
      trfl/base_ops.py
      trfl/clipping_ops.py
      trfl/discrete_policy_gradient_ops.py
      trfl/dist_value_ops.py
      trfl/distribution_ops.py
      trfl/dpg_ops.py
      trfl/gen_distribution_ops.py
      trfl/indexing_ops.py
      trfl/periodic_ops.py
      trfl/pixel_control_ops.py
      trfl/policy_gradient_ops.py
      trfl/retrace_ops.py
      trfl/sequence_ops.py
      trfl/target_update_ops.py
      trfl/value_ops.py
      trfl/vtrace_ops.py
    

    Console Output from Building TRFL

    in trfl/ on master
    › ./configure.sh
    rm: .bazelrc: No such file or directory
    using installed tensorflow
    
    in trfl/ on master
    › bazel build -c opt :build_pip_pkg
    Extracting Bazel installation...
    Starting local Bazel server and connecting to it...
    WARNING: /private/var/tmp/_bazel_Abdel/75cbfa485f5681f3bc2b2ed75d0c5ecd/external/local_config_tf/BUILD:3504:1: target 'libtensorflow_framework.so' is both a rule and a file; please choose another name for the
     rule
    WARNING: /private/var/tmp/_bazel_Abdel/75cbfa485f5681f3bc2b2ed75d0c5ecd/external/local_config_tf/BUILD:5:12: in hdrs attribute of cc_library rule @local_config_tf//:tf_header_lib: file '_api_implementation.so
    ' from target '@local_config_tf//:tf_header_include' is not allowed in hdrs
    WARNING: /private/var/tmp/_bazel_Abdel/75cbfa485f5681f3bc2b2ed75d0c5ecd/external/local_config_tf/BUILD:5:12: in hdrs attribute of cc_library rule @local_config_tf//:tf_header_lib: file '_message.so' from targ
    et '@local_config_tf//:tf_header_include' is not allowed in hdrs
    INFO: Analysed target //:build_pip_pkg (18 packages loaded, 254 targets configured).
    INFO: Found 1 target...
    Target //:build_pip_pkg up-to-date:
      bazel-bin/build_pip_pkg
    INFO: Elapsed time: 31.389s, Critical Path: 18.91s
    INFO: 5 processes: 5 darwin-sandbox.
    INFO: Build completed successfully, 11 total actions
    
    in trfl/ on master
    › mkdir /tmp/trfl_wheels
    
    in trfl/ on master
    › ./bazel-bin/build_pip_pkg /tmp/trfl_wheels
    ++ uname -s
    ++ tr A-Z a-z
    + PLATFORM=darwin
    + PIP_FILE_PREFIX=bazel-bin/build_pip_pkg.runfiles/__main__/
    + main /tmp/trfl_wheels
    + [[ ! -z /tmp/trfl_wheels ]]
    + [[ /tmp/trfl_wheels == \m\a\k\e ]]
    + DEST=/tmp/trfl_wheels
    + shift
    + [[ ! -z '' ]]
    + [[ -z /tmp/trfl_wheels ]]
    + mkdir -p /tmp/trfl_wheels
    + [[ darwin == \d\a\r\w\i\n ]]
    ++ greadlink -f /tmp/trfl_wheels
    + DEST=/private/tmp/trfl_wheels
    + echo '=== destination directory: /private/tmp/trfl_wheels'
    === destination directory: /private/tmp/trfl_wheels
    ++ mktemp -d -t tmp.XXXXXXXXXX
    + TMPDIR=/var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
    ++ date
    + echo Wed 6 Mar 2019 13:18:36 AEDT : '=== Using tmpdir: /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2'
    Wed 6 Mar 2019 13:18:36 AEDT : === Using tmpdir: /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
    + echo '=== Copy TRFL files'
    === Copy TRFL files
    + cp bazel-bin/build_pip_pkg.runfiles/__main__/LICENSE /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
    + cp bazel-bin/build_pip_pkg.runfiles/__main__/MANIFEST.in /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
    + cp bazel-bin/build_pip_pkg.runfiles/__main__/setup.py /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
    + rsync -avm -L '--exclude=*_test.py' bazel-bin/build_pip_pkg.runfiles/__main__/trfl /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
    building file list ... done
    trfl/
    trfl/__init__.py
    trfl/_gen_distribution_ops.so
    trfl/action_value_ops.py
    trfl/base_ops.py
    trfl/clipping_ops.py
    trfl/discrete_policy_gradient_ops.py
    trfl/dist_value_ops.py
    trfl/distribution_ops.py
    trfl/dpg_ops.py
    trfl/gen_distribution_ops.py
    trfl/indexing_ops.py
    trfl/periodic_ops.py
    trfl/pixel_control_ops.py
    trfl/policy_gradient_ops.py
    trfl/retrace_ops.py
    trfl/sequence_ops.py
    trfl/target_update_ops.py
    trfl/value_ops.py
    trfl/vtrace_ops.py
    
    sent 210894 bytes  received 444 bytes  140892.00 bytes/sec
    total size is 209421  speedup is 0.99
    + pushd /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
    /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2 ~/trfl
    ++ date
    + echo Wed 6 Mar 2019 13:18:37 AEDT : '=== Building wheel'
    Wed 6 Mar 2019 13:18:37 AEDT : === Building wheel
    + python setup.py bdist_wheel
    warning: no files found matching '*.dll' under directory 'trfl/'
    warning: no files found matching '*.lib' under directory 'trfl/'
    warning: no files found matching '*.pyd' under directory 'trfl/'
    + cp dist/trfl-1.0-cp36-cp36m-macosx_10_7_x86_64.whl /private/tmp/trfl_wheels
    + popd
    ~/trfl
    + rm -rf /var/folders/17/pgf9tjwd5_j4kml9hns8qfg80000gn/T/tmp.XXXXXXXXXX.Psy2xyC2
    ++ date
    + echo Wed 6 Mar 2019 13:18:38 AEDT : '=== Output wheel file is in: /private/tmp/trfl_wheels'
    Wed 6 Mar 2019 13:18:38 AEDT : === Output wheel file is in: /private/tmp/trfl_wheels
    
    in trfl/ on master
    › pip install /tmp/trfl_wheels/*.whl
    Processing /tmp/trfl_wheels/trfl-1.0-cp36-cp36m-macosx_10_7_x86_64.whl
    Requirement already satisfied: six in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from trfl==1.0) (1.11.0)
    Requirement already satisfied: dm-sonnet in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from trfl==1.0) (1.27)
    Requirement already satisfied: absl-py in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from trfl==1.0) (0.2.0)
    Requirement already satisfied: wrapt in /Users/Abdel/.local/lib/python3.6/site-packages (from trfl==1.0) (1.10.11)
    Requirement already satisfied: numpy in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from trfl==1.0) (1.14.5)
    Requirement already satisfied: semantic-version in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from dm-sonnet->trfl==1.0) (2.6.0)
    Requirement already satisfied: contextlib2 in /Users/Abdel/Developer/anaconda/lib/python3.6/site-packages (from dm-sonnet->trfl==1.0) (0.5.5)
    Installing collected packages: trfl
    Successfully installed trfl-1.0
    
    opened by abdel 6
  • Issue with pip install trfl on MacOs

    Issue with pip install trfl on MacOs

    Hello,

    I get the following error when using pip install trfl

    Could not find a version that satisfies the requirement trfl (from versions: ) No matching distribution found for trfl

    I have tensorflow 1.13.1 & tensorflow-probability 0.60

    Do you have an idea what the issue could be? Thanks in advance for your help

    opened by HenrikMettler 3
  • Clarification of some abbreviations?

    Clarification of some abbreviations?

    Dear Deepminder:

    During a group meeting I was raised a question about the meanings of abbreviations in the demo code of TRFL when I tried to introduce TRFL to my lab members. So I have to ask it here.

    It reads:

    q_tm1: the action value in the source state of a transition.
    a_tm1: the action that was selected in the source state.
    

    What does m1 mean here? I know "q" stands for action value, "t" stands for time step, I tried to figure "m1" stands for what, but it is not so intuitive.

    Could you please help me on that? Thanks a lot.

    opened by mingyr 3
  • policy_gradient_loss batch_shape requirements

    policy_gradient_loss batch_shape requirements

    Why does policy_gradient_ops.policy_gradient_loss require batch_shape to be a rank 2 tensor? This would limit the policy_gradient_loss operation to only single univariate distributions that implement log_prob?

    For instance, consider the problem where the actions are multivariate and follow a normal distribution:

    >>> import tensorflow as tf; tf.enable_eager_execution()
    >>> import tensorflow.contrib.eager as tfe
    >>> import tensorflow_probability as tfp
    >>> loc = tfe.Variable(tf.zeros([5, 5, 2]))
    >>> policy = tfp.distributions.Normal(loc=loc, scale=1.)
    <tfp.distributions.Normal 'Normal/' batch_shape=(5, 5, 2) event_shape=() dtype=float32>
    >>> trfl.policy_gradient_loss(policy, tf.zeros([5, 5, 2]), tf.ones([5, 5]), [loc])
    Traceback (most recent call last):
      File "/trfl/policy_gradient_ops.py", line 119, in policy_gradient_loss
        policies_.batch_shape.assert_has_rank(2)
      File "/tensorflow/python/framework/tensor_shape.py", line 728, in assert_has_rank
        raise ValueError("Shape %s must have rank %d" % (self, rank))
    ValueError: Shape (5, 5, 2) must have rank 2
    

    I could understand how it is a requirement for a discrete distribution. But for the sake of supporting other distributions, it may be more structured to require the log_prob to be rank 3 and then perform a summation operation over the leading dimension:

    >>> policy.log_prob(tf.zeros([5, 5, 2])).shape
    TensorShape([Dimension(5), Dimension(5), Dimension(2)])
    >>> tf.reduce_sum(policy.log_prob(tf.zeros([5, 5, 2])), axis=-1).shape
    TensorShape([Dimension(5), Dimension(5)])
    

    Thank you for your support and time.

    opened by wenkesj 2
  • Add/alias dpg critic update

    Add/alias dpg critic update

    Hi, the DPG critic update (see Algorithm 1 of Lillicrap et al. 2016, https://arxiv.org/abs/1509.02971) is substantively the same as your td_learning function; however, this is currently obscured. I would suggest adding a dpg_qlearning function that aliases td_learning in dpg_ops.py:

    from trfl.value_ops import td_learning
    ...
    dpg_qlearning = td_learning
    

    Alternatively, one could add a comment referencing the td_learning fn in the dpg actor update fn.

    opened by spitis 2
  • Fixing documentation highlights

    Fixing documentation highlights

    Fix unhighlighted function parameters (trfl/dpg_ops.py line 61), fix mis-highlighted parameters (trfl/dpg_ops.py line 54~58), fix the unhighlighted code snippets (trfl/sequence_ops.py), fix unhighlighted shapes (everything else).

    opened by zuoanqh 2
  • Questions about retrace implementation

    Questions about retrace implementation

    Hey,

    I was looking at the retrace ops provided by trfl and there are a couple of implementation details that seem a bit confusing to me.

    1. It seems like trfl retrace drops the discount terms from the 𝔼_π Q(x_t, .) term. This is in line with the retrace formulation in Equation 13 in MPO paper [1], but is different from Equation 4 in the original retrace paper [2]. I have included a small test case below that shows this. Is this a bug or a conscious choice? Edit: actually, it seems like at least one of the terms is included in the continuation probs.

    2. In retrace_ops._general_off_policy_corrected_multistep_target comments, it's mentioned that exp_q_t = 𝔼_π Q(x_{t+1},.) and qa_t = Q(x_t, a_t), indicating that exp_q_t should be one timestep ahead of qa_t: https://github.com/deepmind/trfl/blob/e633edbd9d326b8bebc7c7c7d53f37118b48a440/trfl/retrace_ops.py#L252-L253 However, If I understand this correctly, when those values are actually assigned, they come from the same time indices: https://github.com/deepmind/trfl/blob/e633edbd9d326b8bebc7c7c7d53f37118b48a440/trfl/retrace_ops.py#L263-L264 It's possible that the target_policy_t values that are used to index for exp_q_t somehow account this, but I can't wrap my head around how that would do it. Am I misunderstanding something here or is it possible that these indices are actually off?

    [1] Abdolmaleki, A., Springenberg, J.T., Tassa, Y., Munos, R., Heess, N. and Riedmiller, M., 2018. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920. [2] Munos, R., Stepleton, T., Harutyunyan, A. and Bellemare, M., 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems (pp. 1054-1062).

    Code related to question 1 (click to expand):

    The test case is simplified (e.g. just one action) and I have used a slightly modified version of trfl to make it compatible with tf2, but all the logic should be the correct.

    import numpy as np
    import tensorflow as tf
    
    from trfl import retrace_ops
    
    
    lambda_ = 0.99
    discount = 0.9
    Q_values = np.array([
        [[2.2], [5.2]],
        [[7.2], [4.2]],
        [[3.2], [4.2]],
        [[2.2], [9.2]]], dtype=np.float32)
    target_Q_values = np.array([
        [[2.], [5.]],
        [[7.], [4.]],
        [[3.], [4.]],
        [[2.], [9.]]], dtype=np.float32)
    actions = np.array([
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0]])
    rewards = np.array([
        [1.9, 2.9],
        [3.9, 4.9],
        [5.9, 6.9],
        [np.nan, np.nan],  # nan marks entries we should never use.
    ], dtype=np.float32)
    pcontinues = np.array([
        [0.8, 0.9],
        [0.7, 0.8],
        [0.6, 0.5],
        [np.nan, np.nan]], dtype=np.float32)
    target_policy_probs = np.array([
        [[np.nan] * 1, [np.nan] * 1],
        [[1.0], [1.0]],
        [[1.0], [1.0]],
        [[1.0], [1.0]]], dtype=np.float32)
    behavior_policy_probs = np.array([
        [np.nan, np.nan],
        [1.0, 1.0],
        [1.0, 1.0],
        [1.0, 1.0]], dtype=np.float32)
    
    
    def retrace_original_v1(
            lambda_,
            discount,
            target_Q_values,
            actions,
            rewards,
            target_policy_probs,
            behavior_policy_probs):
        actions = actions[1:, ...]
        rewards = rewards[:-1, ...]
    
        target_policy_probs = target_policy_probs[1:, ...]
        behavior_policy_probs = behavior_policy_probs[1:, ...]
    
        traces = lambda_ * np.minimum(
            1.0, target_policy_probs / behavior_policy_probs[..., None])
    
        deltas = (
            rewards[..., None]
            + discount * target_Q_values[1:]
            - target_Q_values[:-1])
        retraces = []
        for i in range(tf.shape(traces)[0]):
            sum_terms = []
            for t in range(i, tf.shape(traces)[0]):
                trace = tf.reduce_prod([
                    traces[k]
                    for k in range(i + 1, t + 1)
                ], axis=0)
                sum_term = discount ** (t - i) * trace * deltas[t]
                sum_terms.append(sum_term)
    
            result = tf.reduce_sum(sum_terms, axis=0)
            retraces.append(result)
    
        retraces = tf.stack(retraces) + target_Q_values[:-1]
        return retraces
    
    
    output_original_v1 = retrace_original_v1(
        lambda_,
        1.0,
        target_Q_values,
        actions,
        rewards,
        target_policy_probs,
        behavior_policy_probs)
    print(f"output_original_v1:\n{output_original_v1.numpy().round(3)}\n")
    
    output_original_discounted_v1 = retrace_original_v1(
        lambda_,
        discount,
        target_Q_values,
        actions,
        rewards,
        target_policy_probs,
        behavior_policy_probs)
    print(f"output_original_discounted_v1:\n{output_original_discounted_v1.numpy().round(3)}\n")
    
    
    output_trfl_v1 = retrace_ops.retrace(
        lambda_,
        Q_values,
        target_Q_values,
        actions,
        rewards,
        tf.ones_like(rewards),
        target_policy_probs,
        behavior_policy_probs,
    ).extra.target[..., None]
    
    
    tf.debugging.assert_near(output_original_v1, output_trfl_v1)  # succeeds
    tf.debugging.assert_near(output_original_discounted_v1, output_trfl_v1)  # fails
    
    opened by hartikainen 1
  • How is deterministic policy gradient being evaluated?

    How is deterministic policy gradient being evaluated?

    I cannot grasp the steps for lines 87 to 92 in trfl/blob/master/trfl/dpg_ops.py. Why is a target_a being created? The subsequent stop_gradient is understandable since we don't want to update the Q-network's trainable variables. But then, what does this loss represent in the next line? DPG to me is an application of the chain rule. How is the optimization of loss helping update the network?

    I don't know if there is a better way to ask this question as I could not contact the authors of the dpg_ops.py (mainly Matteo Hessel and Miljan Martic) by any other means.

    opened by AvisekNaug 1
  • Fix legal_actions_mask bug in epsilon_greedy().

    Fix legal_actions_mask bug in epsilon_greedy().

    This PR addresses Issue https://github.com/deepmind/trfl/issues/27

    Note that the bug must be addressed in two places. First, when selecting max_value - it must only be selected from legal actions. Second, when computing greedy_probs - there could be multiple action values achieving the max, but not all of them legal.

    Also, I added legal_actions_mask to the list of values in the tf.name_scope context manager.

    Aside from that, when there is no legal actions mask the epsilon_greedy function should execute exactly the same as before.

    Happy to implement any small changes and if there’s a better fix altogether feel free to close this and implement it internally. Just thought I’d offer up a solution :)

    opened by jhtschultz 0
  • Legal actions mask bug

    Legal actions mask bug

    Found a bug in epsilon_greedy() in policy_ops.py when applying legal_actions_mask. It fails when masking the action with the highest action value.

    For example:

    action_values = [2.0, 1.0, 1.0]
    legal_actions_mask = [0., 1., 1.]
    epsilon = 0.1
    result = policy_ops.epsilon_greedy(action_values, epsilon, legal_actions_mask).probs
    

    Outputs: [0.9 0.05 0.05]

    opened by jhtschultz 0
  • Retrace Ops: documented return shapes

    Retrace Ops: documented return shapes

    Hi, it seems like the documented returns shapes for the following functions might be off:

    1. retrace_ops.retrace(...)
    2. retrace_ops.retrace_core(...)
    3. retrace_ops._general_off_policy_corrected_multistep_target(...)

    The first two are documented to return shape [B] and third shape [T, B, num_actions], while they all appear to return [T, B].

    Some test code to check.

    import numpy as np
    import tensorflow as tf
    
    from trfl import retrace_ops, indexing_ops
    
    
    ### Example input data: 
    # https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops_test.py#L41
    
    lambda_ = 0.9
    qs = [
        [[2.2, 3.2, 4.2],
         [5.2, 6.2, 7.2]],
        [[7.2, 6.2, 5.2],
         [4.2, 3.2, 2.2]],
        [[3.2, 5.2, 7.2],
         [4.2, 6.2, 9.2]],
        [[2.2, 8.2, 4.2],
         [9.2, 1.2, 8.2]]
         ]
    targnet_qs = [
        [[2., 3., 4.],
         [5., 6., 7.]],
        [[7., 6., 5.],
         [4., 3., 2.]],
        [[3., 5., 7.],
         [4., 6., 9.]],
        [[2., 8., 4.],
         [9., 1., 8.]]
         ]
    actions = [
        [2, 0], 
        [1, 2], 
        [0, 1], 
        [2, 0]
        ]
    rewards = [
        [1.9, 2.9], 
        [3.9, 4.9], 
        [5.9, 6.9], 
        [np.nan, np.nan]  # nan marks entries we should never use.
        ]
    pcontinues = [
        [0.8, 0.9], 
        [0.7, 0.8], 
        [0.6, 0.5], 
        [np.nan, np.nan]
        ]
    target_policy_probs = [
        [[np.nan] * 3,
         [np.nan] * 3],
        [[0.41, 0.28, 0.31],
         [0.19, 0.77, 0.04]],
        [[0.22, 0.44, 0.34],
         [0.14, 0.25, 0.61]],
        [[0.16, 0.72, 0.12],
         [0.33, 0.30, 0.37]]
         ]
    behaviour_policy_probs = [
        [np.nan, np.nan], 
        [0.85, 0.86], 
        [0.87, 0.88], 
        [0.89, 0.84]
        ]
    
    ### Retrace Test: ###
    retrace = retrace_ops.retrace(
            lambda_, qs, targnet_qs, actions, rewards,
            pcontinues, target_policy_probs, behaviour_policy_probs)
    
    # qs: shape [(T+1), B, num_actions] 
    # https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops.py#L85
    T = len(qs) - 1  # sequence length
    B = len(qs[0])  # batch dimension
    N = len(qs[0][0])  # number of actions
    
    # loss: documented shape [B] 
    # https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops.py#L121
    tf.debugging.assert_equal(retrace.loss.shape, [T, B])  # succeeds
    
    ### Multi-step target Test: ###
    timesteps = tf.shape(qs)[0] # Batch size is qs_shape[1].
    timestep_indices_tm1 = tf.range(0, timesteps - 1)
    timestep_indices_t = tf.range(1, timesteps)
    
    target_policy_t = tf.gather(target_policy_probs, timestep_indices_t)
    behaviour_policy_t = tf.gather(behaviour_policy_probs, timestep_indices_t)
    a_t = tf.gather(actions, timestep_indices_t)
    r_t = tf.gather(rewards, timestep_indices_tm1)
    pcont_t = tf.gather(pcontinues, timestep_indices_tm1)
    targnet_q_t = tf.gather(targnet_qs, timestep_indices_t)
    
    c_t = retrace_ops._retrace_weights(
            indexing_ops.batched_index(target_policy_t, a_t),
            behaviour_policy_t) * lambda_
    
    target = retrace_ops._general_off_policy_corrected_multistep_target(
      r_t, pcont_t, target_policy_t, c_t, targnet_q_t, a_t
    )
    
    # target: documented shape [T, B, N] 
    # https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops.py#L241
    tf.debugging.assert_equal(target.shape, [T, B])  # succeeds
    

    opened by tseyde 0
  • Pre-built python 3.7 packages

    Pre-built python 3.7 packages

    Pre-built wheel packages do not have python 3.7 --- https://pypi.org/project/trfl/#files.

    Since this library does not depend on old behaviors Python 3 (if I am not wrong), it would be great to upload py37 packages to pypi.

    opened by wookayin 0
Owner
DeepMind
DeepMind
TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning. TF-Agents makes implementing, de

null 2.4k Dec 29, 2022
Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

Tensorforce 3.2k Jan 2, 2023
TensorFlow Reinforcement Learning

TRFL TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Le

DeepMind 3.1k Dec 29, 2022
A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

OpenAI 29.6k Jan 1, 2023
Doom-based AI Research Platform for Reinforcement Learning from Raw Visual Information. :godmode:

ViZDoom ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is primarily intended for research

Marek Wydmuch 1.5k Dec 30, 2022
A toolkit for reproducible reinforcement learning research.

garage garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementa

Reinforcement Learning Working Group 1.6k Jan 9, 2023
An open source robotics benchmark for meta- and multi-task reinforcement learning

Meta-World Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic

Reinforcement Learning Working Group 823 Jan 6, 2023
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Status: Maintenance (expect bug fixes and minor updates) Baselines OpenAI Baselines is a set of high-quality implementations of reinforcement learning

OpenAI 13.5k Jan 7, 2023
A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Stable Baselines Stable Baselines is a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines. You can read a

Ashley Hill 3.7k Jan 1, 2023
A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)

Applied Reinforcement Learning @ Facebook Overview ReAgent is an open source end-to-end platform for applied reinforcement learning (RL) developed and

Facebook Research 3.3k Jan 5, 2023
Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Dopamine Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grok

Google 10k Jan 7, 2023
Deep Reinforcement Learning for Keras.

Deep Reinforcement Learning for Keras What is it? keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seaml

Keras-RL 5.4k Jan 4, 2023
ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Ch

Chainer 1.1k Dec 26, 2022
Open world survival environment for reinforcement learning

Crafter Open world survival environment for reinforcement learning. Highlights Crafter is a procedurally generated 2D world, where the agent finds foo

Danijar Hafner 213 Jan 5, 2023
Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

MARL Tricks Our codes for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implemented and standardiz

null 404 Dec 25, 2022
Paddle-RLBooks is a reinforcement learning code study guide based on pure PaddlePaddle.

Paddle-RLBooks Welcome to Paddle-RLBooks which is a reinforcement learning code study guide based on pure PaddlePaddle. 欢迎来到Paddle-RLBooks,该仓库主要是针对强化学

AgentMaker 117 Dec 12, 2022
Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

Coach Coach is a python reinforcement learning framework containing implementation of many state-of-the-art algorithms. It exposes a set of easy-to-us

Intel Labs 2.2k Jan 5, 2023
Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

CQL-JAX This repository implements Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX (FLAX). Implementation is built on

Karush Suri 8 Nov 7, 2022
Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

Manav Mishra 4 Apr 15, 2022