Soft actor-critic is a deep reinforcement learning framework for training maximum entropy policies in continuous domains.

Tuomas Haarnoja

Last update: Jan 7, 2023

Related tags

Deep Learning sac

Overview

This repository is no longer maintained. Please use our new Softlearning package instead.

Soft Actor-Critic

Soft actor-critic is a deep reinforcement learning framework for training maximum entropy policies in continuous domains. The algorithm is based on the paper Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor presented at ICML 2018.

This implementation uses Tensorflow. For a PyTorch implementation of soft actor-critic, take a look at rlkit by Vitchyr Pong.

See the DIAYN documentation for using SAC for learning diverse skills.

Getting Started

Soft Actor-Critic can be run either locally or through Docker.

Prerequisites

You will need to have Docker and Docker Compose installed unless you want to run the environment locally.

Most of the models require a Mujoco license.

Docker installation

If you want to run the Mujoco environments, the docker environment needs to know where to find your Mujoco license key (mjkey.txt). You can either copy your key into /.mujoco/mjkey.txt, or you can specify the path to the key in your environment variables:

export MUJOCO_LICENSE_PATH=
   
    /mjkey.txt

Once that's done, you can run the Docker container with

docker-compose up

Docker compose creates a Docker container named soft-actor-critic and automatically sets the needed environment variables and volumes.

You can access the container with the typical Docker exec-command, i.e.

docker exec -it soft-actor-critic bash

See examples section for examples of how to train and simulate the agents.

To clean up the setup:

docker-compose down

Local installation

To get the environment installed correctly, you will first need to clone rllab, and have its path added to your PYTHONPATH environment variable.

Clone rllab

cd 
   
    
git clone https://github.com/rll/rllab.git
cd rllab
git checkout b3a28992eca103cab3cb58363dd7a4bb07f250a0
export PYTHONPATH=$(pwd):${PYTHONPATH}

Download and copy mujoco files to rllab path: If you're running on OSX, download https://www.roboti.us/download/mjpro131_osx.zip instead, and copy the .dylib files instead of .so files.

mkdir -p /tmp/mujoco_tmp && cd /tmp/mujoco_tmp
wget -P . https://www.roboti.us/download/mjpro131_linux.zip
unzip mjpro131_linux.zip
mkdir 
   
    /rllab/vendor/mujoco
cp ./mjpro131/bin/libmujoco131.so 
    
     /rllab/vendor/mujoco
cp ./mjpro131/bin/libglfw.so.3 
     
      /rllab/vendor/mujoco
cd ..
rm -rf /tmp/mujoco_tmp

Copy your Mujoco license key (mjkey.txt) to rllab path:

cp 
   
    /mjkey.txt 
    
     /rllab/vendor/mujoco

Clone sac

cd 
   
    
git clone https://github.com/haarnoja/sac.git
cd sac

Create and activate conda environment

cd sac
conda env create -f environment.yml
source activate sac

The environment should be ready to run. See examples section for examples of how to train and simulate the agents.

Finally, to deactivate and remove the conda environment:

source deactivate
conda remove --name sac --all

Examples

Training and simulating an agent

To train the agent

python ./examples/mujoco_all_sac.py --env=swimmer --log_dir="/root/sac/data/swimmer-experiment"

To simulate the agent (NOTE: This step currently fails with the Docker installation, due to missing display.)

python ./scripts/sim_policy.py /root/sac/data/swimmer-experiment/itr_
   
    .pkl

mujoco_all_sac.py contains several different environments and there are more example scripts available in the /examples folder. For more information about the agents and configurations, run the scripts with --help flag. For example:

python ./examples/mujoco_all_sac.py --help
usage: mujoco_all_sac.py [-h]
                         [--env {ant,walker,swimmer,half-cheetah,humanoid,hopper}]
                         [--exp_name EXP_NAME] [--mode MODE]
                         [--log_dir LOG_DIR]

python ./examples/mujoco_all_sac.py --help
usage: mujoco_all_sac.py [-h]
                         [--env {ant,walker,swimmer,half-cheetah,humanoid,hopper}]
                         [--exp_name EXP_NAME] [--mode MODE]
                         [--log_dir LOG_DIR]

Benchmark Results

Benchmark results for some of the OpenAI Gym v2 environments can be found here.

Credits

The soft actor-critic algorithm was developed by Tuomas Haarnoja under the supervision of Prof. Sergey Levine and Prof. Pieter Abbeel at UC Berkeley. Special thanks to Vitchyr Pong, who wrote some parts of the code, and Kristian Hartikainen who helped testing, documenting, and polishing the code and streamlining the installation process. The work was supported by Berkeley Deep Drive.

Reference

@article{haarnoja2017soft,
  title={Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor},
  author={Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey},
  booktitle={Deep Reinforcement Learning Symposium},
  year={2017}
}

Comments

NNDiscriminatorFunction error

Hi,

I was able to install and run the sample SAC code. However, while executing python examples/mujoco_all_diayn.py --env=half-cheetah --log_dir=data/demo, I got the following errors:

value_function.py", line 50, in __init__ Parameterized.__init__(self) NameError: name 'Parameterized' is not defined

This was resolved by adding this import to value_function.py: from sandbox.rocky.tf.core.parameterized import Parameterized. However, I'm getting another error at this point:

  File "/private/home/sramakri/Projects/diayn/rllab/scripts/run_experiment_lite.py", line 137, in <module>
    run_experiment(sys.argv)
  File "/private/home/sramakri/Projects/diayn/rllab/scripts/run_experiment_lite.py", line 121, in run_experiment
    method_call(variant_data)
  File "examples/mujoco_all_diayn.py", line 221, in run_experiment
    num_skills=variant['num_skills'],
  File "/private/home/sramakri/Projects/diayn/sac/sac/value_functions/value_function.py", line 69, in __init__
    self._output_t = self.get_output_for(*self._input_pls)
  File "/private/home/sramakri/Projects/diayn/sac/sac/misc/mlp.py", line 179, in get_output_for
    output_nonlinearity=self._output_nonlinearity,
AttributeError: 'NNDiscriminatorFunction' object has no attribute '_output_nonlinearity'

I'm not sure how to resolve this error because self._output_nonlinearity is defined for the parent class MLPFunction but not the child class NNDiscriminatorFunction, where get_output_for is called.

opened by srama2512 6

Double Q for DIAYN

Hi, Forgive me if this is already explained/implemented (part time grad student, pretty new to this):

On reading through the DIAYN code/initial reading of the paper, it seems to not use the double q that is present in SAC. What is the reason for this?

I was also surprised that it seems that DIAYN completely overrides the actor/critic training functions of SAC as opposed to extending them.

opened by josiahls 5
for discrete env

I read the paper DIAYN just now, and can't understand how to train the DIAYN in an env with discrete actions, because SAC is for continuous env. But in the paper, some experiments are based on mountain car and inverted pendulum. Thank you

opened by ccplxx 3
TypeError: __init__() got an unexpected keyword argument 'event_ndims'

I followed the installation instructions and ran the example command bwlow and got a Type Error.

python ./examples/mujoco_all_sac.py --env=swimmer

I had to change import .variants to import variants in mujoco_all_sac.yI think this is fine because I still get variants.file = '[mypath]/sac/examples/variants.py'

Then, I got this type error:

2018-07-05 17:46:10.885203 PDT | Setting seed to 5 using seed 5 WARNING:tensorflow:Variable += will be deprecated. Use variable.assign_add if you want assignment to the variable value or 'x = x + y' if you want a new python Tensor object. [2018-07-05 17:46:14,736] Variable += will be deprecated. Use variable.assign_add if you want assignment to the variable value or 'x = x + y' if you want a new python Tensor object. Traceback (most recent call last): File "/home/coline/Research2018/affordances/rllab/scripts/run_experiment_lite.py", line 137, in run_experiment(sys.argv) File "/home/coline/Research2018/affordances/rllab/scripts/run_experiment_lite.py", line 121, in run_experiment method_call(variant_data) File "./examples/mujoco_all_sac.py", line 137, in run_experiment observations_preprocessor=observations_preprocessor) File "/home/coline/Research2018/affordances/sac/sac/policies/latent_space_policy.py", line 58, in init self.build() File "/home/coline/Research2018/affordances/sac/sac/policies/latent_space_policy.py", line 122, in build event_ndims=self._Da) File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 280, in init self.build() File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 311, in build for i in range(1, num_coupling_layers + 1) File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 311, in for i in range(1, num_coupling_layers + 1) File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 96, in init name=name) TypeError: init() got an unexpected keyword argument 'event_ndims'

opened by cdevin 3
Sparse Reward Environments

Did you happen to see SAC's performance on sparse-reward environments?

I know the DIAYN paper trained on sparse rewards, but I was wondering if vanilla SAC (in your expts) had any luck solving things like Continuous MountainCar.

opened by bhairavmehta95 3
How to run DIAYN on softlearning repo ?

README of this project says that sac is not updated. However, softlearning does not compatible with examples/mujoco_all_diayn.py. Nor does similar DIAYN training code exist in softlearning.

What I can do if I want to run DIAYN on softlearning ?

opened by ZhuFengdaaa 2
Hyperparameter Advice

Hi Tuomas. I'm trying out your SAC implementation on some of the continuous gym environments and I'm curious if you have any recommendations for how to best tune the hyperparameters. Using the defaults and a temperature of 1, for instance, leads to some wildly oscillating policy performance on LunarLanderContinuous or InvertedPendulum. The policy may generate very good returns, then suddenly in the next entry in progress.csv terrible returns, and oscillates up and down without stabilizing. Does that suggest the temperature parameter needs to be tuned, or are some of the other default hyperparameters not ideal for these sorts of tasks?

An example of the episode return for lunar lander against samples:

Thanks!

opened by Random-Word 2
potential recursive call in get_actions(self, observations) from sac/policies/gmm.py

Is calling super(GMMPolicy, self).get_actions(observations) (line 158) the expected behavior here when self._is_deterministic is false as it seems to call the function itself?

opened by fangqyi 1
The comprehension of the policy limitations in SAC
I very admire SAC you created. I have one guess about SAC's policy, and I would like to your confirm:

We assume that q function obeys Boltzmann distribution, but it seems difficult to code that the policy obeys Boltzmann distribution. Therefore, we actually code that the policy obeys the most commonly used Gaussian distribution. However, when there are more than one good actions in the same state,q function is multimodal, and the Gaussian distribution tends to be flat and thus becomes weak. As the article description https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/

Is the comprehension correct that "it is difficult to code that the policy obeys Boltzmann distribution"? Which have been distributions with better performance than the Gaussian distribution? I want to ask for your opinion.

Looking forward to your reply!
opened by GyChou 1
Inquiries about the benchmark results

I checked the benchmark results as provided in the github and try to plot the results. However, I noticed that the results are quite different with the results in the paper. Why is that?

For example, the results in the paper for Ant environment shows final performance at almost 6000. However, the raw data given in the benchmark shows only around 4000

Thank you in advance for your time

opened by anahrendra 1
Reward scale

Some factors of reward scaling can generates instabilities, like described in #9 .

For alleviating this issue wouldn't it be a good idea to divide log_prob by reward_scale instead of multiplying the reward by it? Algorithmically speaking I think this would have the same effect.

opened by lgvaz 1
TD3 vs SAC

Hi, First, thanks for sharing the repo. I am really confused by the performance comparison between SAC and TD3. In TD3's results, TD3 beats SAC in every environment evaluated with max avg. return after 1M timesteps (Table 1). However, in your SAC paper (Fig.1 ) it could be observed that almost in no environment TD3 beats SAC. Is this because of different noises added in your and their experiments? Could you kindly provide some insights into this observation?

opened by HYDesmondLiu 0
DIAYN result reproduction & additional charts

Hi!

I am currently trying to verify my DIAYN implementation and I was wondering if there are any additional results available that are not provided within the original paper or the website the paper links to? More specifically, I was wandering if there are Figure 2 (c) (page 5DIAYN) Training dynamics equivalents for HalfCheetah, Hopper, Ant and other environments that are not InvertedPendulum or MountainCar?

I know that verifying DIAYN goes way beyond just looking at Training Dynamics metrics as one must also determine if the learned skills are actually diverse, but I think having the previously mentioned charts would be a great first step when testing for reproducibility.

opened by Dolokhow 0
About markovian environments
Hi, thanks for the thorough implementation and making this code available, it really helps to understand the internal mechanisms of the SAC algorithm.

I have a question regarding the code in sac/sac/envs/gym_env.py - At the file's header - you comment: " Rllab implementation with a HACK. See comment in GymEnv.init().", and then in the init() method, you write:

# HACK: Gets rid of the TimeLimit wrapper that sets 'done = True' when # the time limit specified for each environment has been passed and # therefore the environment is not Markovian (terminal condition depends # on time rather than state).

I understand the point here, but I'm not sure I followed the implementation, as it seems to be an internal Gym code and is not found in the SAC code found in this repository.

Can you explain exactly what are you doing with the TimeLimit wrapper? If you omit the done flag, do you still terminate the episode?

Specifically - in Gym's registration.py file the env class is wrapped with:

if env.spec.max_episode_steps is not None: from gym.wrappers.time_limit import TimeLimit env = TimeLimit(env, max_episode_steps=env.spec.max_episode_steps)

Furthermore, in the time_limit.py file -

def step(self, action): assert self._elapsed_steps is not None, "Cannot call env.step() before calling reset()" observation, reward, done, info = self.env.step(action) self._elapsed_steps += 1 if self._elapsed_steps >= self._max_episode_steps: info['TimeLimit.truncated'] = not done done = True return observation, reward, done, info

If you omit these lines of code - how does the environment resets itself when the max_episode_steps flag is raised?

Thanks!

Lior
opened by shanlior 4
maximization bias

Hello, I'm not sure whether this is an issue or not but I've been looking at your implementation for half an hour, and I think there might be a maximization bias in the implementation. Specifically, you used the same set of experience to update two q-tables. The paper says two independent q-tables will benefit training. I've tested my thought out on a similar code base and the owner agreed with my view so far. I've opened a stack overflow question here. Could you say something about this? I think I'll test the implementation as well. Thanks in advance.

opened by mikelty 1
a mathematical problem ..

I derived Equation 12, but the result is not the same as Equation 13 in your paper. In my derivation, I didn't get the first item in Equation 13, I don't know where it is wrong. can you help me..？

opened by bofen97 1
what is "sandbox"

Traceback (most recent call last): File "/home/xtq/sac/examples/mujoco_all_sac.py", line 15, in from sac.algos import SAC File "/home/xtq/sac/sac/algos/init.py", line 2, in from .diayn import DIAYN File "/home/xtq/sac/sac/algos/diayn.py", line 10, in from sac.policies.hierarchical_policy import FixedOptionPolicy File "/home/xtq/sac/sac/policies/init.py", line 1, in from .nn_policy import NNPolicy File "/home/xtq/sac/sac/policies/nn_policy.py", line 6, in from sandbox.rocky.tf.policies.base import Policy ImportError: No module named 'sandbox'

opened by zienn 2

Owner

Tuomas Haarnoja

GitHub

Multi-task Multi-agent Soft Actor Critic for SMAC

Multi-task Multi-agent Soft Actor Critic for SMAC Overview The CARE formulti-task: Multi-Task Reinforcement Learning with Context-based Representation

8 Sep 30, 2022

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

pytorch-a2c-ppo-acktr Update (April 12th, 2021) PPO is great, but Soft Actor Critic can be better for many continuous control tasks. Please check out

3k Jan 9, 2023

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

3k Dec 31, 2022

Using deep actor-critic model to learn best strategies in pair trading

Deep-Reinforcement-Learning-in-Stock-Trading Using deep actor-critic model to learn best strategies in pair trading Abstract Partially observed Markov

281 Dec 9, 2022

Asynchronous Advantage Actor-Critic in PyTorch

Asynchronous Advantage Actor-Critic in PyTorch This is PyTorch implementation of A3C as described in Asynchronous Methods for Deep Reinforcement Learn

38 Dec 12, 2022

PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

Advantage async actor-critic Algorithms (A3C) in PyTorch @inproceedings{mnih2016asynchronous, title={Asynchronous methods for deep reinforcement lea

111 Dec 8, 2022

Advantage Actor Critic (A2C): jax + flax implementation

Advantage Actor Critic (A2C): jax + flax implementation Current version supports only environments with continious action spaces and was tested on muj

3 Jan 23, 2022

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation [Project website] [Paper] This project is a PyTorch i

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC

6 Feb 28, 2022

PyTorch code accompanying our paper on Maximum Entropy Generators for Energy-Based Models

Maximum Entropy Generators for Energy-Based Models All experiments have tensorboard visualizations for samples / density / train curves etc. To run th

135 Oct 27, 2022

Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO)

V-MPO Simple code to demonstrate Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) in Pyt

9 Jun 6, 2022

MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

page_type languages products description sample python azure azure-machine-learning-service azure-devops Code which demonstrates how to set up and ope

1 Nov 1, 2021

Soft actor-critic is a deep reinforcement learning framework for training maximum entropy policies in continuous domains.

Related tags

Overview

Soft Actor-Critic

Getting Started

Prerequisites

Docker installation

Local installation

Examples

Training and simulating an agent

Benchmark Results

Credits

Reference

Comments

Owner

Tuomas Haarnoja

Multi-task Multi-agent Soft Actor Critic for SMAC

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Using deep actor-critic model to learn best strategies in pair trading

Asynchronous Advantage Actor-Critic in PyTorch

PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

Advantage Actor Critic (A2C): jax + flax implementation

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

PyTorch code accompanying our paper on Maximum Entropy Generators for Energy-Based Models

Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO)

MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

On the model-based stochastic value gradient for continuous reinforcement learning

Implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hashing by Maximizing Bit Entropy

Neural Dynamic Policies for End-to-End Sensorimotor Learning

Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Code for NeurIPS 2021 paper: Invariant Causal Imitation Learning for Generalizable Policies

The Multi-Mission Maximum Likelihood framework (3ML)

Tensorflow Implementation of SMU: SMOOTH ACTIVATION FUNCTION FOR DEEP NETWORKS USING SMOOTHING MAXIMUM TECHNIQUE