Soft actor-critic is a deep reinforcement learning framework for training maximum entropy policies in continuous domains.

Related tags

Deep Learning sac
Overview

This repository is no longer maintained. Please use our new Softlearning package instead.

Soft Actor-Critic

Soft actor-critic is a deep reinforcement learning framework for training maximum entropy policies in continuous domains. The algorithm is based on the paper Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor presented at ICML 2018.

This implementation uses Tensorflow. For a PyTorch implementation of soft actor-critic, take a look at rlkit by Vitchyr Pong.

See the DIAYN documentation for using SAC for learning diverse skills.

Getting Started

Soft Actor-Critic can be run either locally or through Docker.

Prerequisites

You will need to have Docker and Docker Compose installed unless you want to run the environment locally.

Most of the models require a Mujoco license.

Docker installation

If you want to run the Mujoco environments, the docker environment needs to know where to find your Mujoco license key (mjkey.txt). You can either copy your key into /.mujoco/mjkey.txt , or you can specify the path to the key in your environment variables:

export MUJOCO_LICENSE_PATH=
   
    /mjkey.txt

   

Once that's done, you can run the Docker container with

docker-compose up

Docker compose creates a Docker container named soft-actor-critic and automatically sets the needed environment variables and volumes.

You can access the container with the typical Docker exec-command, i.e.

docker exec -it soft-actor-critic bash

See examples section for examples of how to train and simulate the agents.

To clean up the setup:

docker-compose down

Local installation

To get the environment installed correctly, you will first need to clone rllab, and have its path added to your PYTHONPATH environment variable.

  1. Clone rllab
cd 
   
    
git clone https://github.com/rll/rllab.git
cd rllab
git checkout b3a28992eca103cab3cb58363dd7a4bb07f250a0
export PYTHONPATH=$(pwd):${PYTHONPATH}

   
  1. Download and copy mujoco files to rllab path: If you're running on OSX, download https://www.roboti.us/download/mjpro131_osx.zip instead, and copy the .dylib files instead of .so files.
mkdir -p /tmp/mujoco_tmp && cd /tmp/mujoco_tmp
wget -P . https://www.roboti.us/download/mjpro131_linux.zip
unzip mjpro131_linux.zip
mkdir 
   
    /rllab/vendor/mujoco
cp ./mjpro131/bin/libmujoco131.so 
    
     /rllab/vendor/mujoco
cp ./mjpro131/bin/libglfw.so.3 
     
      /rllab/vendor/mujoco
cd ..
rm -rf /tmp/mujoco_tmp

     
    
   
  1. Copy your Mujoco license key (mjkey.txt) to rllab path:
cp 
   
    /mjkey.txt 
    
     /rllab/vendor/mujoco

    
   
  1. Clone sac
cd 
   
    
git clone https://github.com/haarnoja/sac.git
cd sac

   
  1. Create and activate conda environment
cd sac
conda env create -f environment.yml
source activate sac

The environment should be ready to run. See examples section for examples of how to train and simulate the agents.

Finally, to deactivate and remove the conda environment:

source deactivate
conda remove --name sac --all

Examples

Training and simulating an agent

  1. To train the agent
python ./examples/mujoco_all_sac.py --env=swimmer --log_dir="/root/sac/data/swimmer-experiment"
  1. To simulate the agent (NOTE: This step currently fails with the Docker installation, due to missing display.)
python ./scripts/sim_policy.py /root/sac/data/swimmer-experiment/itr_
   
    .pkl

   

mujoco_all_sac.py contains several different environments and there are more example scripts available in the /examples folder. For more information about the agents and configurations, run the scripts with --help flag. For example:

python ./examples/mujoco_all_sac.py --help
usage: mujoco_all_sac.py [-h]
                         [--env {ant,walker,swimmer,half-cheetah,humanoid,hopper}]
                         [--exp_name EXP_NAME] [--mode MODE]
                         [--log_dir LOG_DIR]

mujoco_all_sac.py contains several different environments and there are more example scripts available in the /examples folder. For more information about the agents and configurations, run the scripts with --help flag. For example:

python ./examples/mujoco_all_sac.py --help
usage: mujoco_all_sac.py [-h]
                         [--env {ant,walker,swimmer,half-cheetah,humanoid,hopper}]
                         [--exp_name EXP_NAME] [--mode MODE]
                         [--log_dir LOG_DIR]

Benchmark Results

Benchmark results for some of the OpenAI Gym v2 environments can be found here.

Credits

The soft actor-critic algorithm was developed by Tuomas Haarnoja under the supervision of Prof. Sergey Levine and Prof. Pieter Abbeel at UC Berkeley. Special thanks to Vitchyr Pong, who wrote some parts of the code, and Kristian Hartikainen who helped testing, documenting, and polishing the code and streamlining the installation process. The work was supported by Berkeley Deep Drive.

Reference

@article{haarnoja2017soft,
  title={Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor},
  author={Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey},
  booktitle={Deep Reinforcement Learning Symposium},
  year={2017}
}
Comments
  • NNDiscriminatorFunction error

    NNDiscriminatorFunction error

    Hi,

    I was able to install and run the sample SAC code. However, while executing python examples/mujoco_all_diayn.py --env=half-cheetah --log_dir=data/demo, I got the following errors:

    value_function.py", line 50, in __init__ Parameterized.__init__(self) NameError: name 'Parameterized' is not defined

    This was resolved by adding this import to value_function.py: from sandbox.rocky.tf.core.parameterized import Parameterized. However, I'm getting another error at this point:

      File "/private/home/sramakri/Projects/diayn/rllab/scripts/run_experiment_lite.py", line 137, in <module>
        run_experiment(sys.argv)
      File "/private/home/sramakri/Projects/diayn/rllab/scripts/run_experiment_lite.py", line 121, in run_experiment
        method_call(variant_data)
      File "examples/mujoco_all_diayn.py", line 221, in run_experiment
        num_skills=variant['num_skills'],
      File "/private/home/sramakri/Projects/diayn/sac/sac/value_functions/value_function.py", line 69, in __init__
        self._output_t = self.get_output_for(*self._input_pls)
      File "/private/home/sramakri/Projects/diayn/sac/sac/misc/mlp.py", line 179, in get_output_for
        output_nonlinearity=self._output_nonlinearity,
    AttributeError: 'NNDiscriminatorFunction' object has no attribute '_output_nonlinearity'
    

    I'm not sure how to resolve this error because self._output_nonlinearity is defined for the parent class MLPFunction but not the child class NNDiscriminatorFunction, where get_output_for is called.

    opened by srama2512 6
  • Double Q for DIAYN

    Double Q for DIAYN

    Hi, Forgive me if this is already explained/implemented (part time grad student, pretty new to this):

    On reading through the DIAYN code/initial reading of the paper, it seems to not use the double q that is present in SAC. What is the reason for this?

    I was also surprised that it seems that DIAYN completely overrides the actor/critic training functions of SAC as opposed to extending them.

    opened by josiahls 5
  • for discrete env

    for discrete env

    I read the paper DIAYN just now, and can't understand how to train the DIAYN in an env with discrete actions, because SAC is for continuous env. But in the paper, some experiments are based on mountain car and inverted pendulum. Thank you

    opened by ccplxx 3
  • TypeError: __init__() got an unexpected keyword argument 'event_ndims'

    TypeError: __init__() got an unexpected keyword argument 'event_ndims'

    I followed the installation instructions and ran the example command bwlow and got a Type Error.

    python ./examples/mujoco_all_sac.py --env=swimmer

    I had to change import .variants to import variants in mujoco_all_sac.yI think this is fine because I still get variants.file = '[mypath]/sac/examples/variants.py'

    Then, I got this type error:

    2018-07-05 17:46:10.885203 PDT | Setting seed to 5 using seed 5 WARNING:tensorflow:Variable += will be deprecated. Use variable.assign_add if you want assignment to the variable value or 'x = x + y' if you want a new python Tensor object. [2018-07-05 17:46:14,736] Variable += will be deprecated. Use variable.assign_add if you want assignment to the variable value or 'x = x + y' if you want a new python Tensor object. Traceback (most recent call last): File "/home/coline/Research2018/affordances/rllab/scripts/run_experiment_lite.py", line 137, in run_experiment(sys.argv) File "/home/coline/Research2018/affordances/rllab/scripts/run_experiment_lite.py", line 121, in run_experiment method_call(variant_data) File "./examples/mujoco_all_sac.py", line 137, in run_experiment observations_preprocessor=observations_preprocessor) File "/home/coline/Research2018/affordances/sac/sac/policies/latent_space_policy.py", line 58, in init self.build() File "/home/coline/Research2018/affordances/sac/sac/policies/latent_space_policy.py", line 122, in build event_ndims=self._Da) File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 280, in init self.build() File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 311, in build for i in range(1, num_coupling_layers + 1) File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 311, in for i in range(1, num_coupling_layers + 1) File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 96, in init name=name) TypeError: init() got an unexpected keyword argument 'event_ndims'

    opened by cdevin 3
  • Sparse Reward Environments

    Sparse Reward Environments

    Did you happen to see SAC's performance on sparse-reward environments?

    I know the DIAYN paper trained on sparse rewards, but I was wondering if vanilla SAC (in your expts) had any luck solving things like Continuous MountainCar.

    opened by bhairavmehta95 3
  • How to run DIAYN on softlearning repo ?

    How to run DIAYN on softlearning repo ?

    README of this project says that sac is not updated. However, softlearning does not compatible with examples/mujoco_all_diayn.py. Nor does similar DIAYN training code exist in softlearning.

    What I can do if I want to run DIAYN on softlearning ?

    opened by ZhuFengdaaa 2
  • Hyperparameter Advice

    Hyperparameter Advice

    Hi Tuomas. I'm trying out your SAC implementation on some of the continuous gym environments and I'm curious if you have any recommendations for how to best tune the hyperparameters. Using the defaults and a temperature of 1, for instance, leads to some wildly oscillating policy performance on LunarLanderContinuous or InvertedPendulum. The policy may generate very good returns, then suddenly in the next entry in progress.csv terrible returns, and oscillates up and down without stabilizing. Does that suggest the temperature parameter needs to be tuned, or are some of the other default hyperparameters not ideal for these sorts of tasks?

    An example of the episode return for lunar lander against samples: lunarlandersac

    Thanks!

    opened by Random-Word 2
  • potential recursive call in get_actions(self, observations) from sac/policies/gmm.py

    potential recursive call in get_actions(self, observations) from sac/policies/gmm.py

    Is calling super(GMMPolicy, self).get_actions(observations) (line 158) the expected behavior here when self._is_deterministic is false as it seems to call the function itself?

    opened by fangqyi 1
  • The comprehension of the policy limitations in SAC

    The comprehension of the policy limitations in SAC

    I very admire SAC you created. I have one guess about SAC's policy, and I would like to your confirm:

    • We assume that q function obeys Boltzmann distribution, but it seems difficult to code that the policy obeys Boltzmann distribution. Therefore, we actually code that the policy obeys the most commonly used Gaussian distribution. However, when there are more than one good actions in the same state,q function is multimodal, and the Gaussian distribution tends to be flat and thus becomes weak. As the article description https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/

    Is the comprehension correct that "it is difficult to code that the policy obeys Boltzmann distribution"? Which have been distributions with better performance than the Gaussian distribution? I want to ask for your opinion.

    Looking forward to your reply!

    opened by GyChou 1
  • Inquiries about the benchmark results

    Inquiries about the benchmark results

    I checked the benchmark results as provided in the github and try to plot the results. However, I noticed that the results are quite different with the results in the paper. Why is that?

    For example, the results in the paper for Ant environment shows final performance at almost 6000. However, the raw data given in the benchmark shows only around 4000

    Thank you in advance for your time

    opened by anahrendra 1
  • Reward scale

    Reward scale

    Some factors of reward scaling can generates instabilities, like described in #9 .

    For alleviating this issue wouldn't it be a good idea to divide log_prob by reward_scale instead of multiplying the reward by it? Algorithmically speaking I think this would have the same effect.

    opened by lgvaz 1
  • TD3 vs SAC

    TD3 vs SAC

    Hi, First, thanks for sharing the repo. I am really confused by the performance comparison between SAC and TD3. In TD3's results, TD3 beats SAC in every environment evaluated with max avg. return after 1M timesteps (Table 1). However, in your SAC paper (Fig.1 ) it could be observed that almost in no environment TD3 beats SAC. Is this because of different noises added in your and their experiments? Could you kindly provide some insights into this observation?

    opened by HYDesmondLiu 0
  • DIAYN result reproduction & additional charts

    DIAYN result reproduction & additional charts

    Hi!

    I am currently trying to verify my DIAYN implementation and I was wondering if there are any additional results available that are not provided within the original paper or the website the paper links to? More specifically, I was wandering if there are Figure 2 (c) (page 5DIAYN) Training dynamics equivalents for HalfCheetah, Hopper, Ant and other environments that are not InvertedPendulum or MountainCar?

    I know that verifying DIAYN goes way beyond just looking at Training Dynamics metrics as one must also determine if the learned skills are actually diverse, but I think having the previously mentioned charts would be a great first step when testing for reproducibility.

    opened by Dolokhow 0
  • About markovian environments

    About markovian environments

    Hi, thanks for the thorough implementation and making this code available, it really helps to understand the internal mechanisms of the SAC algorithm.

    I have a question regarding the code in sac/sac/envs/gym_env.py - At the file's header - you comment: " Rllab implementation with a HACK. See comment in GymEnv.init().", and then in the init() method, you write:

    # HACK: Gets rid of the TimeLimit wrapper that sets 'done = True' when
    # the time limit specified for each environment has been passed and
    # therefore the environment is not Markovian (terminal condition depends
    # on time rather than state).
    

    I understand the point here, but I'm not sure I followed the implementation, as it seems to be an internal Gym code and is not found in the SAC code found in this repository.

    Can you explain exactly what are you doing with the TimeLimit wrapper? If you omit the done flag, do you still terminate the episode?

    Specifically - in Gym's registration.py file the env class is wrapped with:

    if env.spec.max_episode_steps is not None:
        from gym.wrappers.time_limit import TimeLimit
        env = TimeLimit(env, max_episode_steps=env.spec.max_episode_steps)
    

    Furthermore, in the time_limit.py file -

    def step(self, action):
        assert self._elapsed_steps is not None, "Cannot call env.step() before calling reset()"
        observation, reward, done, info = self.env.step(action)
        self._elapsed_steps += 1
         if self._elapsed_steps >= self._max_episode_steps:
             info['TimeLimit.truncated'] = not done
             done = True
         return observation, reward, done, info
    

    If you omit these lines of code - how does the environment resets itself when the max_episode_steps flag is raised?

    Thanks!

    Lior

    opened by shanlior 4
  • maximization bias

    maximization bias

    Hello, I'm not sure whether this is an issue or not but I've been looking at your implementation for half an hour, and I think there might be a maximization bias in the implementation. Specifically, you used the same set of experience to update two q-tables. The paper says two independent q-tables will benefit training. I've tested my thought out on a similar code base and the owner agreed with my view so far. I've opened a stack overflow question here. Could you say something about this? I think I'll test the implementation as well. Thanks in advance.

    opened by mikelty 1
  • a mathematical problem ..

    a mathematical problem ..

    I derived Equation 12, but the result is not the same as Equation 13 in your paper. In my derivation, I didn't get the first item in Equation 13, I don't know where it is wrong. can you help me..?

    opened by bofen97 1
  • what is

    what is "sandbox"

    Traceback (most recent call last): File "/home/xtq/sac/examples/mujoco_all_sac.py", line 15, in from sac.algos import SAC File "/home/xtq/sac/sac/algos/init.py", line 2, in from .diayn import DIAYN File "/home/xtq/sac/sac/algos/diayn.py", line 10, in from sac.policies.hierarchical_policy import FixedOptionPolicy File "/home/xtq/sac/sac/policies/init.py", line 1, in from .nn_policy import NNPolicy File "/home/xtq/sac/sac/policies/nn_policy.py", line 6, in from sandbox.rocky.tf.policies.base import Policy ImportError: No module named 'sandbox'

    opened by zienn 2
Owner
Tuomas Haarnoja
Tuomas Haarnoja
Multi-task Multi-agent Soft Actor Critic for SMAC

Multi-task Multi-agent Soft Actor Critic for SMAC Overview The CARE formulti-task: Multi-Task Reinforcement Learning with Context-based Representation

RuanJingqing 8 Sep 30, 2022
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Ilya Kostrikov 3k Dec 31, 2022
Using deep actor-critic model to learn best strategies in pair trading

Deep-Reinforcement-Learning-in-Stock-Trading Using deep actor-critic model to learn best strategies in pair trading Abstract Partially observed Markov

null 281 Dec 9, 2022
Asynchronous Advantage Actor-Critic in PyTorch

Asynchronous Advantage Actor-Critic in PyTorch This is PyTorch implementation of A3C as described in Asynchronous Methods for Deep Reinforcement Learn

Reiji Hatsugai 38 Dec 12, 2022
PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

Advantage async actor-critic Algorithms (A3C) in PyTorch @inproceedings{mnih2016asynchronous, title={Asynchronous methods for deep reinforcement lea

LEI TAI 111 Dec 8, 2022
Advantage Actor Critic (A2C): jax + flax implementation

Advantage Actor Critic (A2C): jax + flax implementation Current version supports only environments with continious action spaces and was tested on muj

Andrey 3 Jan 23, 2022
Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation [Project website] [Paper] This project is a PyTorch i

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 6 Feb 28, 2022
PyTorch code accompanying our paper on Maximum Entropy Generators for Energy-Based Models

Maximum Entropy Generators for Energy-Based Models All experiments have tensorboard visualizations for samples / density / train curves etc. To run th

Rithesh Kumar 135 Oct 27, 2022
Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO)

V-MPO Simple code to demonstrate Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) in Pyt

Nugroho Dewantoro 9 Jun 6, 2022
MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

page_type languages products description sample python azure azure-machine-learning-service azure-devops Code which demonstrates how to set up and ope

null 1 Nov 1, 2021
[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

PG-MORL This repository contains the implementation for the paper Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Contro

MIT Graphics Group 65 Jan 7, 2023
On the model-based stochastic value gradient for continuous reinforcement learning

On the model-based stochastic value gradient for continuous reinforcement learning This repository is by Brandon Amos, Samuel Stanton, Denis Yarats, a

Facebook Research 46 Dec 15, 2022
Implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hashing by Maximizing Bit Entropy

Deep Unsupervised Image Hashing by Maximizing Bit Entropy This is the PyTorch implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hash

null 62 Dec 30, 2022
Neural Dynamic Policies for End-to-End Sensorimotor Learning

This is a PyTorch based implementation for our NeurIPS 2020 paper on Neural Dynamic Policies for end-to-end sensorimotor learning.

Shikhar Bahl 47 Dec 11, 2022
Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Legged Robots that Keep on Learning Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World, whic

Laura Smith 70 Dec 7, 2022
Code for NeurIPS 2021 paper: Invariant Causal Imitation Learning for Generalizable Policies

Invariant Causal Imitation Learning for Generalizable Policies Ioana Bica, Daniel Jarrett, Mihaela van der Schaar Neural Information Processing System

Ioana Bica 17 Dec 1, 2022
The Multi-Mission Maximum Likelihood framework (3ML)

PyPi Conda The Multi-Mission Maximum Likelihood framework (3ML) A framework for multi-wavelength/multi-messenger analysis for astronomy/astrophysics.

The Multi-Mission Maximum Likelihood (3ML) 62 Dec 30, 2022
Tensorflow Implementation of SMU: SMOOTH ACTIVATION FUNCTION FOR DEEP NETWORKS USING SMOOTHING MAXIMUM TECHNIQUE

SMU A Tensorflow Implementation of SMU: SMOOTH ACTIVATION FUNCTION FOR DEEP NETWORKS USING SMOOTHING MAXIMUM TECHNIQUE arXiv https://arxiv.org/abs/211

Fuhang 5 Jan 18, 2022