A set of Deep Reinforcement Learning Agents implemented in Tensorflow.

Arthur Juliani

Last update: Jan 1, 2023

Related tags

Overview

Deep Reinforcement Learning Agents

This repository contains a collection of reinforcement learning algorithms written in Tensorflow. The ipython notebook here were written to go along with a still-underway tutorial series I have been publishing on Medium. If you are new to reinforcement learning, I recommend reading the accompanying post for each algorithm.

The repository currently contains the following algorithms:

Q-Table - An implementation of Q-learning using tables to solve a stochastic environment problem.
Q-Network - A neural network implementation of Q-Learning to solve the same environment as in Q-Table.
Simple-Policy - An implementation of policy gradient method for stateless environments such as n-armed bandit problems.
Contextual-Policy - An implementation of policy gradient method for stateful environments such as contextual bandit problems.
Policy-Network - An implementation of a neural network policy-gradient agent that solves full RL problems with states and delayed rewards, and two opposite actions (ie. CartPole or Pong).
Vanilla-Policy - An implementation of a neural network vanilla-policy-gradient agent that solves full RL problems with states, delayed rewards, and an arbitrary number of actions.
Model-Network - An addition to the Policy-Network algorithm which includes a separate network which models the environment dynamics.
Double-Dueling-DQN - An implementation of a Deep-Q Network with the Double DQN and Dueling DQN additions to improve stability and performance.
Deep-Recurrent-Q-Network - An implementation of a Deep Recurrent Q-Network which can solve reinforcement learning problems involving partial observability.
Q-Exploration - An implementation of DQN containing multiple action-selection strategies for exploration. Strategies include: greedy, random, e-greedy, Boltzmann, and Bayesian Dropout.
A3C-Doom - An implementation of Asynchronous Advantage Actor-Critic (A3C) algorithm. It utilizes multiple agents to collectively improve a policy. This implementation can solve RL problems in 3D environments such as VizDoom challenges.

Comments

Setting networks to be equal

the statement: updateTarget(targetOps,sess) #Set the target network to be equal to the primary network. uses the op list targetOps which is built with tau=0.001.

Therefore the networks are not the same after the op is executed.

Can you confirm this issue? Does it have an impact on the results?

Another thing: this idea of getting, after update, a network whose weights are a convex combination between the weights of the target and the main networks seems to be a bit weird to me, even if it comes from a paper done by deep mind. starting from the same initialization and using very small update weights it might work, but in general it should actually not work at all. Interpolating between network weights of different runs can potentially disrupt performances. (i will have a look to the paper though, even though i would love to hear your comments as an expert in this field)

opened by faustomilletari 6
Segmentation fault (core dumped)

Hi, I has read your blogs about the RL and like it very much. When I tried to run the A3C-Doom, I came across the Segmentation fault (core dumped) error after the terminal output starting workers . My computer has 32G memory,and two E5 CPUs. The error troubles me a lot and I wounder if you can give me some advice. Thanks.

opened by nanxintin 5
Something wrong with Contextual-Policy.ipython

Dear Arthur,

I am following your tutorials for reinforcement learning. It is very helpful. However, when I try to run "Contextual-Policy.ipython", I encounter some problems. Could you tell me how to solve it?

`TypeError Traceback (most recent call last) in () 2 3 cBandit = contextual_bandit() #Load the bandits. ----> 4 myAgent = agent(lr=0.001,s_size=cBandit.num_bandits,a_size=cBandit.num_actions) #Load the agent. 5 weights = tf.trainable_variables()[0] #The weights we will evaluate to look into the network. 6

in init(self, lr, s_size, a_size) 4 self.state_in= tf.placeholder(shape=[1],dtype=tf.int32) 5 state_in_OH = slim.one_hot_encoding(self.state_in,s_size) ----> 6 output = slim.fully_connected(state_in_OH,a_size, biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones) 7 self.output = tf.reshape(output,[-1]) 8 self.chosen_action = tf.argmax(self.output,0)

/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.pyc in func_with_args(*args, **kwargs) 175 current_args = current_scope[key_func].copy() 176 current_args.update(kwargs) --> 177 return func(*args, **current_args) 178 _add_op(func) 179 setattr(func_with_args, '_key_op', _key_op(func))

/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.pyc in fully_connected(inputs, num_outputs, activation_fn, normalizer_fn, normalizer_params, weights_initializer, weights_regularizer, biases_initializer, biases_regularizer, reuse, variables_collections, outputs_collections, trainable, scope) 841 regularizer=weights_regularizer, 842 collections=weights_collections, --> 843 trainable=trainable) 844 if len(static_shape) > 2: 845 # Reshape inputs

/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.pyc in func_with_args(*args, **kwargs) 175 current_args = current_scope[key_func].copy() 176 current_args.update(kwargs) --> 177 return func(*args, **current_args) 178 _add_op(func) 179 setattr(func_with_args, '_key_op', _key_op(func))

/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.pyc in model_variable(name, shape, dtype, initializer, regularizer, trainable, collections, caching_device, device) 267 initializer=initializer, regularizer=regularizer, 268 trainable=trainable, collections=collections, --> 269 caching_device=caching_device, device=device) 270 271 `

opened by liruihao 4
How to set hyper-parameters? "The right recipe!"
Hi

@DMTSource @awjuliani

Is there a way to set hyper parameters?

Reward value

Parameter Initialization method

LSTM length

Learning rate

Optimizer (Adam or RMSProp)

Gradient Clipping value

After 10s of experiments, I found that any tiny change in one of these affects the whole training dramatically, usually in a bad way.

It is also not logic to conduct a grid search over different parameters, because a single experiment may take hours or days, and cost a lot of money

One trick I usually use, is to use large network and dropout to reduce/eliminate over fitting, but what about all of the above?

Another trick, try to adjust the learning rate * gradient = 1e-3 parameters. (In other works make the parameter update around 1/1000 of the parameter value, to prevent too large to too small updates)

What do you recommend?
opened by IbrahimSobh 3
# For Policy Network Problem

Thanks for your code, but I have a question that if the rewards is negative, do the code still work? If not, how to fix it or ensure the loss keep in positive?

opened by cumttang 3
why the apply_gradient could be totally externally assigned?
I want to ask something about this part in policy network example:

loss = -tf.reduce_mean((tf.log(input_y - probability)) * advantages) newGrads = tf.gradients(loss,tvars) adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer W1Grad = tf.placeholder(tf.float32,name="batch_grad1") # Placeholders to send the final gradients through when we update. W2Grad = tf.placeholder(tf.float32,name="batch_grad2") batchGrad = [W1Grad,W2Grad] updateGrads = adam.apply_gradients(zip(batchGrad,tvars))

where newGrads go out-of-graph, after postprocessing, then inserted to graph through W1Grad and W2Grad.

I am wondering how tensorflow knows the Variable gradient will be put to batchGrad?

I mean, if we just tf.apply_gradient to random placeholders which are not evaluated by some trainable tf.Variable, it should cast the Error: "No gradients provided for any variable".

Your code works, but I am still trying to figure out how it works?
opened by Marsan-Ma-zz 3
What is `grad_norms` in AC_Network?
Hi,

I come across your A3C implementation and find the following 2 lines in AC_network.py:

self.var_norms = tf.global_norm(local_vars) grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)

I wonder what's grad_norms for? It seems to me that it is not used.

Thanks!
opened by yrlu 2
slim?

Vanilla-Policy.ipynb

I am getting this when trying to run. I guess I didn't get a dependency?

Traceback (most recent call last): File "agent2.py", line 68, in myAgent = agent(lr=1e-2,s_size=5,a_size=3,h_size=10) #Load the agent. File "agent2.py", line 41, in init hidden = slim.fully_connected(self.state_in,h_size,biases_initializer=None,activation_fn=tf.nn.relu) NameError: global name 'slim' is not defined

opened by MarkFuini 2

Model-Network occasionally outputs unreasonably big mean reward

I copy your code in a python file and run the simulation several time. Usually it works fine, but occasionally the mean reward become very large. Below is the copy of the output log that you made

 World Perf: Episode 247.000000. Reward 35.333333. action: 0.000000. mean reward 35.000038.
 World Perf: Episode 250.000000. Reward 29.333333. action: 1.000000. mean reward 34.979885.
 World Perf: Episode 253.000000. Reward 39.666667. action: 0.000000. mean reward 34.893707.
 World Perf: Episode 256.000000. Reward 21.000000. action: 1.000000. mean reward 34.590328.
 World Perf: Episode 259.000000. Reward 62.333333. action: 0.000000. mean reward 34.643253.
 World Perf: Episode 262.000000. Reward 40.666667. action: 1.000000. mean reward 34.418655.
 World Perf: Episode 265.000000. Reward 31.000000. action: 1.000000. mean reward 34.128536.
 World Perf: Episode 268.000000. Reward 25.000000. action: 1.000000. mean reward 3763953194369116274688.000000.
 World Perf: Episode 271.000000. Reward 50.333333. action: 0.000000. mean reward 3689050732741573738496.000000.
 World Perf: Episode 274.000000. Reward 20.333333. action: 0.000000. mean reward 3615638681115714125824.000000.
 World Perf: Episode 277.000000. Reward 26.666667. action: 1.000000. mean reward 3543687766093959528448.000000.
 World Perf: Episode 280.000000. Reward 44.000000. action: 0.000000. mean reward 3473168432803755327488.000000.
 World Perf: Episode 283.000000. Reward 19.000000. action: 1.000000. mean reward 3404052533747430457344.000000.
 World Perf: Episode 286.000000. Reward 59.666667. action: 1.000000. mean reward 3336311921427313852416.000000.

It seems that the predicted reward of the Model sometimes become too large. Do you know what is the problem? Is it just some cases that the model failed to learn?

opened by kkjh0723 2

Using xavier initialization on advantage/value weights improves model performance

I just found out in my tests that changing the weights initialization from random_normal to xavier initialization improves the training process a lot.

Using only CPU, the original code takes about 3.5K episodes to reach the reward ~ 22, which is around the maximum reward I was able to obtain reproducing the code.

By using xavier initialization, the code quickly converges to the same result by episode 1K, taking < 30 minutes in my macbook pro using only CPU.

opened by wmitsuda 1
A3C-Doom fixed one-hot def to work with a_size

Setting a_size to a new value would eventually create an error due to the hard coded(line 230) one-hot array of actions. Replaced with numpy identity using a_size and resolved to a list to match the original format.

opened by DMTSource 1

A garbage code in Model-Network.ipynb

In Model-Network.ipynb

garbage code. Should be removed.

input_data = tf.placeholder(tf.float32, [None, 5])
with tf.variable_scope('rnnlm'):
    softmax_w = tf.get_variable("softmax_w", [mH, 50])
    softmax_b = tf.get_variable("softmax_b", [50])

opened by hccho2 0

scipy.misc.imresize is deprecated in Scipy 1.14.3 --> modified code

Before:

b = scipy.misc.imresize(a[:,:,0],[84,84,1],interp='nearest')
c = scipy.misc.imresize(a[:,:,1],[84,84,1],interp='nearest')
d = scipy.misc.imresize(a[:,:,2],[84,84,1],interp='nearest')
a = np.stack([b,c,d],axis=2)

After:

a= (skimage.transform.resize(a,[84,84,3],order=0)*255).astype(np.uint8)

opened by hccho2 1

How to do twice training session for the same buffer

My problem is actually creating some buffer but what stoped me boils down to not being able to do two training session if it is the same buffer the first one succed the second one don't for example v_l,p_l,e_l,g_n,v_n = self.train(episode_buffer,sess,gamma,0.0) v_l,p_l,e_l,g_n,v_n = self.train(episode_buffer,sess,gamma,0.0)

I get this error Exception in thread Thread-144: Traceback (most recent call last): File "C:\Users\PC\Miniconda3\envs\nnseries\lib\threading.py", line 916, in _bootstrap_inner self.run() File "C:\Users\PC\Miniconda3\envs\nnseries\lib\threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "", line 37, in worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver) File "", line 314, in work self.train(fulllist[0],sess,gamma,0.0) File "", line 62, in train feed_dict=feed_dict) File "C:\Users\PC\Miniconda3\envs\nnseries\lib\site-packages\tensorflow\python\client\session.py", line 877, in run run_metadata_ptr) File "C:\Users\PC\Miniconda3\envs\nnseries\lib\site-packages\tensorflow\python\client\session.py", line 1076, in _run str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape () for Tensor 'worker_0/Placeholder_1:0', which has shape '(1, 256)' some how the shape of the buffer changes but doesn't when I do .shape

opened by dark16sider 0
A3C Doom: Why there should be no more workers than there are threads on CPU?

Hi there,

The number of available CPU threads in my machine should be 16. However, I tested the number of workers in CPU only A3C-Doom because I set, import os os.environ["CUDA_VISIBLE_DEVICES"]="" import tensorflow Actually, there can be more workers than 16. I'm confused about this. Do you have any ideas about why?

opened by ZhanPython 0

A set of Deep Reinforcement Learning Agents implemented in Tensorflow.

Related tags

Overview

Deep Reinforcement Learning Agents

Comments

Setting networks to be equal

Segmentation fault (core dumped)

Something wrong with Contextual-Policy.ipython

How to set hyper-parameters? "The right recipe!"

# For Policy Network Problem

why the apply_gradient could be totally externally assigned?

What is `grad_norms` in AC_Network?

slim?

Model-Network occasionally outputs unreasonably big mean reward

Using xavier initialization on advantage/value weights improves model performance

A3C-Doom fixed one-hot def to work with a_size

A garbage code in Model-Network.ipynb

scipy.misc.imresize is deprecated in Scipy 1.14.3 --> modified code

How to do twice training session for the same buffer

A3C Doom: Why there should be no more workers than there are threads on CPU?

Owner

Arthur Juliani

Automatically download the cwru data set, and then divide it into training data set and test data set

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet.

Reinforcement Learning with Q-Learning Algorithm on gym's frozen lake environment implemented in python

Reinforcement learning framework and algorithms implemented in PyTorch.

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

Tensorflow implementation of Human-Level Control through Deep Reinforcement Learning

Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

This is our ARTS test set, an enriched test set to probe Aspect Robustness of ABSA.

Open-Set Recognition: A Good Closed-Set Classifier is All You Need

Script that receives an Image (original) and a set of images to be used as "pixels" in reconstruction of the Original image using the set of images as "pixels"

Trading environnement for RL agents, backtesting and training.

Lux AI environment interface for RLlib multi-agents

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

A lightweight Python-based 3D network multi-agent simulator. Uses a cell-based congestion model. Calculates risk, loudness and battery capacities of the agents. Suitable for 3D network optimization tasks.