Implementation of Deep Deterministic Policy Gradiet Algorithm in Tensorflow

Steven Spielberg P

Last update: Dec 7, 2022

Related tags

Overview

ddpg-aigym

Deep Deterministic Policy Gradient

Implementation of Deep Deterministic Policy Gradiet Algorithm (Lillicrap et al.arXiv:1509.02971.) in Tensorflow

How to use

git clone https://github.com/stevenpjg/ddpg-aigym.git
cd ddpg-aigym
python main.py

During training

Once trained

Learning Curve

The learning curve for InvertedPendulum-v1 environment.

Dependencies

Tensorflow (Developed in tensorflow version 0.11.0rc0 [CPU version] [GPU version])
OpenAi gym
Mujoco

Features

Batch Normalization (improvement in learning speed)
Grad-inverter (given in arXiv: arXiv:1511.04143)

Note

To use different environment

experiment= 'InvertedPendulum-v1' #specify environments here

To use batch normalization

is_batch_norm = True #batch normalization switch

Let me know if there are any issues and clarifications regarding hyperparameter tuning.

Comments

A question of running speed about your code

Hello! I have run your code and there is a problem about it. It seems that the update part where tf.assign is used becomes slower as the code keeps running, and it becomes the bottleneck of running speed. I am wondering if you have come across with the same problem? If so, I am looking forward to the solution. Thanks a lot!

opened by zhuyifengzju 7
Error with GLEW initialization

This is the output that I got: Creating window glfw ERROR: GLEW initalization error: Missing GL version My setup: Python3.5, Ubuntu 16.04, gym from openai official github.

opened by williamissirius 2
A question on action_gradients in critic_net_bn.py

Hi,

I just read through your DDPG implementation, and it looks awesome. Thanks for sharing!

Currently, I feel confusion about the below code self.action_gradients = [self.act_grad_v[0]/tf.to_float(tf.shape(self.act_grad_v[0])[0])] in critic_net_bn.py.

Why do we add [0] after self.act_grad_v since we use a batch of actions to compute gradients? What does "[0]" use for?

Thank you so much!

opened by pxlong 1
Run the codes in the "Reacher" task

Hi, steven! Recently, I have downloaded your codes and test it on the "Reacher" task. However, I found that with GPU-based tensorflow, it could run 200 episodes per day. It seems a bit slow. Is there anything I need to adjust to fasten the process?(I found that the usage of GPU is low, around 3%~10%, maybe the GPU is not used sufficiently) Plus, you said that we could use one more wrapper to scale the reward, can you explain it more specifically? Thanks a lot!

opened by cardwing 1
how to visualize the result with "episode_reward"

hi steven,my system do not have Mujoco,so I combine your code with nrod80's code(https://github.com/nrod80/ddpg-for-openai) to build a new code.But I could not visualize the result.Could you told where is the visualize API?

opened by 937552416 1
Question on Loss function of Critic Network training

Hello,

I just read through your code on DDPG implementation, and it looks awesome :) Currently I have a question to consult you, and I wonder how's the curve of Q loss function looks like with respect to training time when you train Inverted Pendulum with DDPG. Actually, I also implemented the DDPG code by myself, and I noticed that Inverted Pendulum did learn something, but the Q loss was diverged, and I wonder if you have the same issue with your implementation.

Thank you so much!

opened by RuofanKong 1
It si very very slow for Pendulum-v0 of classic control environment

I ran this code for Pendulum-v0 environment, its too too slow on this particular environment. But its considerably faster on InvertedPendulum-v1. Do you have any idea why is it so ?

opened by sarvghotra 1
Error, when ran for other environments like reacher-v1.

I tried to run this code for Reacher-v1 and Swimmer-v1 but it threw an error due to this line. ValueError: total size of new array must be unchanged

Could you please also explain why do you even need this step for InvertedPendulum ?

opened by sarvghotra 1
Need help to understand how grad-inv accelerate learning process

I hope I am not troubling you too much by asking questions.

Could you please help me to understand the notion of the recent changes made to accelerate learning ? BTW is it converging on Reacher-v1 ? Could you please also mention the time taken to learn and your system configuration ? Also, look at this paper for reward scaling, it could be a reason for divergence just in case it is not converging.

opened by sarvghotra 1
Error spotted

This line of code looks wrong. (https://github.com/stevenpjg/ddpg-aigym/blob/master/critic_net.py#L84) It should have critic model predicting not actor model.

opened by sarvghotra 1

Implementation of Deep Deterministic Policy Gradiet Algorithm in Tensorflow

Related tags

Overview

ddpg-aigym

Deep Deterministic Policy Gradient

How to use

During training

Once trained

Learning Curve

Dependencies

Features

Note

Comments

Owner

Steven Spielberg P

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

ppo_pytorch_cpp - an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

An implementation of the proximal policy optimization algorithm

MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification

Code for Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty

This tool converts a Nondeterministic Finite Automata (NFA) into a Deterministic Finite Automata (DFA)

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow.

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

DRLib：A concise deep reinforcement learning library, integrating HER and PER for almost off policy RL algos.

This project provides a stock market environment using OpenGym with Deep Q-learning and Policy Gradient.

PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).

A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PyTorch implementation of Trust Region Policy Optimization

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)

PyTorch implementation of Constrained Policy Optimization