Author's PyTorch implementation of Randomized Ensembled Double Q-Learning (REDQ) algorithm.

Last update: Dec 16, 2022

Related tags

Deep Learning REDQ

Overview

REDQ source code

Author's PyTorch implementation of Randomized Ensembled Double Q-Learning (REDQ) algorithm. Paper link: https://arxiv.org/abs/2101.05982

Mar 23, 2021: We have reorganized the code to make it cleaner and more readable and the first version is now released!

Mar 29, 2021: We tested the installation process and run the code, and everything seems to be working correctly. We are now working on the implementation video tutorial, which will be released soon.

May 3, 2021: We uploaded a video tutorial (shared via google drive), please see link below. Hope it helps!

Code for REDQ-OFE is still being cleaned up and will be released soon (essentially the same code but with additional input from a OFENet).

Code structure explained

The code structure is pretty simple and should be easy to follow.

In experiments/train_redq_sac.py you will find the main training loop. Here we set up the environment, initialize an instance of the REDQSACAgent class, specifying all the hyperparameters and train the agent. You can run this file to train a REDQ agent.

In redq/algos/redq_sac.py we provide code for the REDQSACAgent class. If you are trying to take a look at how the core components of REDQ are implemented, the most important function is the train() function.

In redq/algos/core.py we provide code for some basic classes (Q network, policy network, replay buffer) and some helper functions. These classes and functions are used by the REDQ agent class.

In redq/utils there are some utility classes (such as a logger) and helper functions that mostly have nothing to do with REDQ's core components.

Implementation video tutorial

Here is the link to a video tutorial we created that explains the REDQ implementation in detail:

REDQ code explained video tutorial (Google Drive Link)

Environment setup

Note: you don't need to exactly follow the tutorial here if you know well about how to install python packages.

First create a conda environment and activate it:

conda create -n redq python=3.6
conda activate redq

Install PyTorch (or you can follow the tutorial on PyTorch official website). On Ubuntu (might also work on Windows but is not fully tested):

conda install pytorch==1.3.1 torchvision==0.4.2 cudatoolkit=10.1 -c pytorch

On OSX:

conda install pytorch==1.3.1 torchvision==0.4.2 -c pytorch

Install gym (0.17.2):

git clone https://github.com/openai/gym.git
cd gym
git checkout b2727d6
pip install -e .
cd ..

Install mujoco_py (2.0.2.1):

git clone https://github.com/openai/mujoco-py
cd mujoco-py
git checkout 379bb19
pip install -e . --no-cache
cd ..

For gym and mujoco_py, depending on your system, you might need to install some other packages, if you run into such problems, please refer to their official sites for guidance. If you want to test on Mujoco environments, you will also need to get Mujoco files and license from Mujoco website. Please refer to the Mujoco website for how to do this correctly.

Clone and install this repository (Although even if you don't install it you might still be able to use the code):

git clone https://github.com/watchernyu/REDQ.git
cd REDQ
pip install -e .

Train an REDQ agent

To train an REDQ agent, run:

python experiments/train_redq_sac.py

On a 2080Ti GPU, running Hopper to 125K will approximately take 10-12 hours. Running Humanoid to 300K will approximately take 26 hours.

Implement REDQ

If you intend to implement REDQ on your codebase, please refer to the paper and the tutorial (to be released) for guidance. In particular, in Appendix B of the paper, we discussed hyperparameters and some additional implementation details. One important detail is in the beginning of the training, for the first 5000 data points, we sample random action from the action space and do not perform any updates. If you perform a large number of updates with a very small amount of data, it can lead to severe bias accumulation and can negatively affect the performance.

For REDQ-OFE, as mentioned in the paper, for some reason adding PyTorch batch norm to OFENet will lead to divergence. So in the end we did not use batch norm in our code.

Reproduce the results

If you use a different PyTorch version, it might still work, however, it might be better if your version is close to the ones we used. We have found that for example, on Ant environment, PyTorch 1.3 and 1.2 give quite different results. The reason is not entirely clear.

Other factors such as versions of other packages (for example numpy) or environment (mujoco/gym) or even types of hardware (cpu/gpu) can also affect the final results. Thus reproducing exactly the same results can be difficult. However, if the package versions are the same, when averaged over a large number of random seeds, the overall performance should be similar to those reported in the paper.

As of Mar. 29, 2021, we have used the installation guide on this page to re-setup a conda environment and run the code hosted on this repo and the reproduced results are similar to what we have in the paper (though not exactly the same, in some environments, performance are a bit stronger and others a bit weaker).

Please open an issue if you find any problems in the code, thanks!

Acknowledgement

Our code for REDQ-SAC is partly based on the SAC implementation in OpenAI Spinup (https://github.com/openai/spinningup). The current code structure is inspired by the super clean TD3 source code by Scott Fujimoto (https://github.com/sfujim/TD3).

Comments

Question about Bias Quantification

Dear Che Wang,

Thanks again for providing the details.

I came back to your approach of bias quantification, and I think I have kind of a misunderstanding. In the paper, Section 3, you define a Q-function Q^{\pi} for a policy \pi. Further, Q^{\phi} is an estimate of it, and the bias is consequently Q^{\phi} - Q^{\pi}. In the experiment, you consider the current behavior policy \pi induced by your current actor, which contains some noise for exploration (I looked it up in your code). This is the policy for generating your MC trajectories, which serve as the ground truth and are used to compute returns by definition.

But is comparing these returns created by the continuous analog of an \epsilon-greedy strategy with the learned Q's of the critic a valid approach? In Q-Learning, or the critic training in this case, the Q directly approximates Q* (Sutton & Barto, 2018, p. 131), regardless of the behavior policy. So I am wondering how the pursued approach can be justified.

Again, I believe that I have some conceptual misunderstanding, and I would be glad if you could help me with this point.

Best, Martin

opened by MarWaltz 9
A bit different update implementation from the Algorithm 1 from the paper.

Hi, I found the implementation of code and the pseudocode in the paper is somewhat different.

The algorithm box in the paper said; Update the critics G times and update the actor once. However, the implementation updates critics and actors G times using the same sampled batches. Can you clarify which is the one intended? I believe the code implementation is the real one you've done for producing the reported results.

opened by Junyoungpark 6
Some Questions about Code Implements.
Thanks for this excellent work! I have some questions about the code implements.

In core.py line 214, you do torch.clamp() to log_std. Why we need clamp() here, could it be that log_std will diverge if not clamp? And if the log_std is truncated, how should the gradient propagate (This seems to be a question about torch, but I didn't find a good answer)?

In redq_sac function get_redq_q_target_no_grad, you use policy.forward(obs_next_tensor) with default parameters deterministic==False, why we don't use deterministic==True? The above confusion may also be paraphrased as, the Critic network, i.e. Q_function here, is approximating the policy state-action value function $Q_\pi$, or is approximating the optimal state-action value function $Q_*$?

Looking forward to your answer and thanks in advance.
opened by xiaobanni 4
Does the actor network using the average of current Q(s,a)

when I reproduce the algorithm. When I used average-Q in Actor, the algorithm will not convergence.

and the reward will down. Can you tell me abount the Actor_update detail.

opened by lknownothing 4
Reproduce figures of the paper

Dear authors,

Thank you for providing this excellent paper and the complementary code-base.

However, I would like to reproduce the figures from the paper exactly, e.g., the ones in which the avg and the std of the normalized bias for the Ant-env is depicted. Is this code also publicly available or should I rather try to write this myself? Unfortunately, I could not find it in this repository.

Best, Martin

opened by MarWaltz 3
Avoid policy loss gradients feeding back into Q networks

I realized that in my previous pull request there was a bug affecting the behavior: the gradient from the policy update would also be used in the q networks update. This fixes the problem while still updating all networks concurrently.

Sorry for this issue ^^

p.s. Surprisingly, performance-wise, it did not appear to have a recognizable effect, likely because of the large policy delay adopted (or perhaps a new research direction to investigate ;) ).

opened by Aladoro 2
Refactor order of gradient computations and model updates to make compatible with later versions of Pytorch

Fixes incompatibility with recent versions of Pytorch (tested with torch==1.8.1), where original order of gradient computations/weight updates caused a Runtime error, due to alpha being updated before policy_loss.backward() is called.

Otherwise, none of the behavior should be affected.

Thanks for sharing the code ^^

opened by Aladoro 0

Owner

Ph.D. student at NYU. Deep reinforcement learning researcher.

GitHub

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

3.7k Jan 3, 2023

Grow Function: Generate 3D Stacked Bifurcating Double Deep Cellular Automata based organisms which differentiate using a Genetic Algorithm...

Grow Function: A 3D Stacked Bifurcating Double Deep Cellular Automata which differentiates using a Genetic Algorithm... TLDR;High Def Trees that you can mint as NFTs on Solana

4 Oct 8, 2022

Authors implementation of LieTransformer: Equivariant Self-Attention for Lie Groups

LieTransformer This repository contains the implementation of the LieTransformer used for experiments in the paper LieTransformer: Equivariant self-at

35 Oct 18, 2022

An implementation of the 1. Parallel, 2. Streaming, 3. Randomized SVD using MPI4Py

PYPARSVD This implementation allows for a singular value decomposition which is: Distributed using MPI4Py Streaming - data can be shown in batches to

44 Dec 31, 2022

This is a clean and robust Pytorch implementation of DQN and Double DQN.

DQN/DDQN-Pytorch This is a clean and robust Pytorch implementation of DQN and Double DQN. Here is the training curve: All the experiments are trained

15 Dec 27, 2022

Classic Papers for Beginners and Impact Scope for Authors.

There have been billions of academic papers around the world. However, maybe only 0.0...01% among them are valuable or are worth reading. Since our limited life has never been forever, TopPaper provide a Top Academic Paper Chart for beginners and reseachers to take one step faster.

228 Dec 18, 2022

Text mining project; Using distilBERT to predict authors in the classification task authorship attribution.

DistilBERT-Text-mining-authorship-attribution Dataset used: https://www.kaggle.com/azimulh/tweets-data-for-authorship-attribution-modelling/version/2

1 Jan 13, 2022

Dungeons and Dragons randomized content generator

Component based Dungeons and Dragons generator Supports Entity/Monster Generation NPC Generation Weapon Generation Encounter Generation Environment Ge

3 Dec 4, 2021

Double pendulum simulator using a symplectic Euler's method and Hamiltonian mechanics

Symplectic Double Pendulum Simulator Double pendulum simulator using a symplectic Euler's method. The program calculates the momentum and position of

1 Jan 12, 2022

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow.

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow. They have a parallel sampling feature in order to increase computation speed (especially in high-performance computing (HPC)).

3 Dec 28, 2021

PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

Code for On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models This repository will reproduce the main results from our pape

32 Nov 25, 2022

An unofficial PyTorch implementation of a federated learning algorithm, FedAvg.

Federated Averaging (FedAvg) in PyTorch An unofficial implementation of FederatedAveraging (or FedAvg) algorithm proposed in the paper Communication-E

123 Jan 6, 2023

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Adam-NSCL This is a PyTorch implementation of Adam-NSCL algorithm for continual learning from our CVPR2021 (oral) paper: Title: Training Networks in N

34 Dec 21, 2022

A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019).

ClusterGCN ⠀⠀ A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019). A

697 Dec 27, 2022

Author's PyTorch implementation of Randomized Ensembled Double Q-Learning (REDQ) algorithm.

Related tags

Overview

REDQ source code

Code structure explained

Implementation video tutorial

Environment setup

Train an REDQ agent

Implement REDQ

Reproduce the results

Acknowledgement

Comments

Question about Bias Quantification

A bit different update implementation from the Algorithm 1 from the paper.

Some Questions about Code Implements.

Does the actor network using the average of current Q(s,a)

Reproduce figures of the paper

Avoid policy loss gradients feeding back into Q networks

Refactor order of gradient computations and model updates to make compatible with later versions of Pytorch