A set of Deep Reinforcement Learning Agents implemented in Tensorflow.

Overview

Deep Reinforcement Learning Agents

This repository contains a collection of reinforcement learning algorithms written in Tensorflow. The ipython notebook here were written to go along with a still-underway tutorial series I have been publishing on Medium. If you are new to reinforcement learning, I recommend reading the accompanying post for each algorithm.

The repository currently contains the following algorithms:

  • Q-Table - An implementation of Q-learning using tables to solve a stochastic environment problem.
  • Q-Network - A neural network implementation of Q-Learning to solve the same environment as in Q-Table.
  • Simple-Policy - An implementation of policy gradient method for stateless environments such as n-armed bandit problems.
  • Contextual-Policy - An implementation of policy gradient method for stateful environments such as contextual bandit problems.
  • Policy-Network - An implementation of a neural network policy-gradient agent that solves full RL problems with states and delayed rewards, and two opposite actions (ie. CartPole or Pong).
  • Vanilla-Policy - An implementation of a neural network vanilla-policy-gradient agent that solves full RL problems with states, delayed rewards, and an arbitrary number of actions.
  • Model-Network - An addition to the Policy-Network algorithm which includes a separate network which models the environment dynamics.
  • Double-Dueling-DQN - An implementation of a Deep-Q Network with the Double DQN and Dueling DQN additions to improve stability and performance.
  • Deep-Recurrent-Q-Network - An implementation of a Deep Recurrent Q-Network which can solve reinforcement learning problems involving partial observability.
  • Q-Exploration - An implementation of DQN containing multiple action-selection strategies for exploration. Strategies include: greedy, random, e-greedy, Boltzmann, and Bayesian Dropout.
  • A3C-Doom - An implementation of Asynchronous Advantage Actor-Critic (A3C) algorithm. It utilizes multiple agents to collectively improve a policy. This implementation can solve RL problems in 3D environments such as VizDoom challenges.
Comments
  • Setting networks to be equal

    Setting networks to be equal

    the statement: updateTarget(targetOps,sess) #Set the target network to be equal to the primary network. uses the op list targetOps which is built with tau=0.001.

    Therefore the networks are not the same after the op is executed.

    Can you confirm this issue? Does it have an impact on the results?

    Another thing: this idea of getting, after update, a network whose weights are a convex combination between the weights of the target and the main networks seems to be a bit weird to me, even if it comes from a paper done by deep mind. starting from the same initialization and using very small update weights it might work, but in general it should actually not work at all. Interpolating between network weights of different runs can potentially disrupt performances. (i will have a look to the paper though, even though i would love to hear your comments as an expert in this field)

    opened by faustomilletari 6
  • Segmentation fault (core dumped)

    Segmentation fault (core dumped)

    Hi, I has read your blogs about the RL and like it very much. When I tried to run the A3C-Doom, I came across the Segmentation fault (core dumped) error after the terminal output starting workers . My computer has 32G memory,and two E5 CPUs. The error troubles me a lot and I wounder if you can give me some advice. Thanks.

    opened by nanxintin 5
  • Something wrong with Contextual-Policy.ipython

    Something wrong with Contextual-Policy.ipython

    Dear Arthur,

    I am following your tutorials for reinforcement learning. It is very helpful. However, when I try to run "Contextual-Policy.ipython", I encounter some problems. Could you tell me how to solve it?


    `TypeError Traceback (most recent call last) in () 2 3 cBandit = contextual_bandit() #Load the bandits. ----> 4 myAgent = agent(lr=0.001,s_size=cBandit.num_bandits,a_size=cBandit.num_actions) #Load the agent. 5 weights = tf.trainable_variables()[0] #The weights we will evaluate to look into the network. 6

    in init(self, lr, s_size, a_size) 4 self.state_in= tf.placeholder(shape=[1],dtype=tf.int32) 5 state_in_OH = slim.one_hot_encoding(self.state_in,s_size) ----> 6 output = slim.fully_connected(state_in_OH,a_size, biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones) 7 self.output = tf.reshape(output,[-1]) 8 self.chosen_action = tf.argmax(self.output,0)

    /home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.pyc in func_with_args(*args, **kwargs) 175 current_args = current_scope[key_func].copy() 176 current_args.update(kwargs) --> 177 return func(*args, **current_args) 178 _add_op(func) 179 setattr(func_with_args, '_key_op', _key_op(func))

    /home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.pyc in fully_connected(inputs, num_outputs, activation_fn, normalizer_fn, normalizer_params, weights_initializer, weights_regularizer, biases_initializer, biases_regularizer, reuse, variables_collections, outputs_collections, trainable, scope) 841 regularizer=weights_regularizer, 842 collections=weights_collections, --> 843 trainable=trainable) 844 if len(static_shape) > 2: 845 # Reshape inputs

    /home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.pyc in func_with_args(*args, **kwargs) 175 current_args = current_scope[key_func].copy() 176 current_args.update(kwargs) --> 177 return func(*args, **current_args) 178 _add_op(func) 179 setattr(func_with_args, '_key_op', _key_op(func))

    /home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.pyc in model_variable(name, shape, dtype, initializer, regularizer, trainable, collections, caching_device, device) 267 initializer=initializer, regularizer=regularizer, 268 trainable=trainable, collections=collections, --> 269 caching_device=caching_device, device=device) 270 271 `

    opened by liruihao 4
  • How to set hyper-parameters?

    How to set hyper-parameters? "The right recipe!"

    Hi

    @DMTSource @awjuliani

    Is there a way to set hyper parameters?

    • Reward value
    • Parameter Initialization method
    • LSTM length
    • Learning rate
    • Optimizer (Adam or RMSProp)
    • Gradient Clipping value

    After 10s of experiments, I found that any tiny change in one of these affects the whole training dramatically, usually in a bad way.

    It is also not logic to conduct a grid search over different parameters, because a single experiment may take hours or days, and cost a lot of money

    One trick I usually use, is to use large network and dropout to reduce/eliminate over fitting, but what about all of the above?

    Another trick, try to adjust the learning rate * gradient = 1e-3 parameters. (In other works make the parameter update around 1/1000 of the parameter value, to prevent too large to too small updates)

    What do you recommend?

    opened by IbrahimSobh 3
  • # For Policy Network Problem

    # For Policy Network Problem

    Thanks for your code, but I have a question that if the rewards is negative, do the code still work? If not, how to fix it or ensure the loss keep in positive?

    opened by cumttang 3
  • why the apply_gradient could be totally externally assigned?

    why the apply_gradient could be totally externally assigned?

    I want to ask something about this part in policy network example:

    loss = -tf.reduce_mean((tf.log(input_y - probability)) * advantages) 
    newGrads = tf.gradients(loss,tvars)
    
    adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
    W1Grad = tf.placeholder(tf.float32,name="batch_grad1") # Placeholders to send the final gradients through when we update.
    W2Grad = tf.placeholder(tf.float32,name="batch_grad2")
    batchGrad = [W1Grad,W2Grad]
    updateGrads = adam.apply_gradients(zip(batchGrad,tvars))
    

    where newGrads go out-of-graph, after postprocessing, then inserted to graph through W1Grad and W2Grad.

    I am wondering how tensorflow knows the Variable gradient will be put to batchGrad?

    I mean, if we just tf.apply_gradient to random placeholders which are not evaluated by some trainable tf.Variable, it should cast the Error: "No gradients provided for any variable".

    Your code works, but I am still trying to figure out how it works?

    opened by Marsan-Ma-zz 3
  • What is `grad_norms` in AC_Network?

    What is `grad_norms` in AC_Network?

    Hi,

    I come across your A3C implementation and find the following 2 lines in AC_network.py:

    self.var_norms = tf.global_norm(local_vars)
    grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)
    

    I wonder what's grad_norms for? It seems to me that it is not used.

    Thanks!

    opened by yrlu 2
  • slim?

    slim?

    Vanilla-Policy.ipynb

    I am getting this when trying to run. I guess I didn't get a dependency?

    Traceback (most recent call last): File "agent2.py", line 68, in myAgent = agent(lr=1e-2,s_size=5,a_size=3,h_size=10) #Load the agent. File "agent2.py", line 41, in init hidden = slim.fully_connected(self.state_in,h_size,biases_initializer=None,activation_fn=tf.nn.relu) NameError: global name 'slim' is not defined

    opened by MarkFuini 2
  • Model-Network occasionally outputs unreasonably big mean reward

    Model-Network occasionally outputs unreasonably big mean reward

    I copy your code in a python file and run the simulation several time. Usually it works fine, but occasionally the mean reward become very large. Below is the copy of the output log that you made

     World Perf: Episode 247.000000. Reward 35.333333. action: 0.000000. mean reward 35.000038.
     World Perf: Episode 250.000000. Reward 29.333333. action: 1.000000. mean reward 34.979885.
     World Perf: Episode 253.000000. Reward 39.666667. action: 0.000000. mean reward 34.893707.
     World Perf: Episode 256.000000. Reward 21.000000. action: 1.000000. mean reward 34.590328.
     World Perf: Episode 259.000000. Reward 62.333333. action: 0.000000. mean reward 34.643253.
     World Perf: Episode 262.000000. Reward 40.666667. action: 1.000000. mean reward 34.418655.
     World Perf: Episode 265.000000. Reward 31.000000. action: 1.000000. mean reward 34.128536.
     World Perf: Episode 268.000000. Reward 25.000000. action: 1.000000. mean reward 3763953194369116274688.000000.
     World Perf: Episode 271.000000. Reward 50.333333. action: 0.000000. mean reward 3689050732741573738496.000000.
     World Perf: Episode 274.000000. Reward 20.333333. action: 0.000000. mean reward 3615638681115714125824.000000.
     World Perf: Episode 277.000000. Reward 26.666667. action: 1.000000. mean reward 3543687766093959528448.000000.
     World Perf: Episode 280.000000. Reward 44.000000. action: 0.000000. mean reward 3473168432803755327488.000000.
     World Perf: Episode 283.000000. Reward 19.000000. action: 1.000000. mean reward 3404052533747430457344.000000.
     World Perf: Episode 286.000000. Reward 59.666667. action: 1.000000. mean reward 3336311921427313852416.000000.
    

    It seems that the predicted reward of the Model sometimes become too large. Do you know what is the problem? Is it just some cases that the model failed to learn?

    opened by kkjh0723 2
  • Using xavier initialization on advantage/value weights improves model performance

    Using xavier initialization on advantage/value weights improves model performance

    I just found out in my tests that changing the weights initialization from random_normal to xavier initialization improves the training process a lot.

    Using only CPU, the original code takes about 3.5K episodes to reach the reward ~ 22, which is around the maximum reward I was able to obtain reproducing the code.

    By using xavier initialization, the code quickly converges to the same result by episode 1K, taking < 30 minutes in my macbook pro using only CPU.

    opened by wmitsuda 1
  • A3C-Doom fixed one-hot def to work with a_size

    A3C-Doom fixed one-hot def to work with a_size

    Setting a_size to a new value would eventually create an error due to the hard coded(line 230) one-hot array of actions. Replaced with numpy identity using a_size and resolved to a list to match the original format.

    opened by DMTSource 1
  • A garbage code in Model-Network.ipynb

    A garbage code in Model-Network.ipynb

    In Model-Network.ipynb

    garbage code. Should be removed.

    input_data = tf.placeholder(tf.float32, [None, 5])
    with tf.variable_scope('rnnlm'):
        softmax_w = tf.get_variable("softmax_w", [mH, 50])
        softmax_b = tf.get_variable("softmax_b", [50])
    
    opened by hccho2 0
  • scipy.misc.imresize is deprecated in Scipy 1.14.3 --> modified code

    scipy.misc.imresize is deprecated in Scipy 1.14.3 --> modified code

    Before:

    b = scipy.misc.imresize(a[:,:,0],[84,84,1],interp='nearest')
    c = scipy.misc.imresize(a[:,:,1],[84,84,1],interp='nearest')
    d = scipy.misc.imresize(a[:,:,2],[84,84,1],interp='nearest')
    a = np.stack([b,c,d],axis=2)
    

    After:

    a= (skimage.transform.resize(a,[84,84,3],order=0)*255).astype(np.uint8)
    
    opened by hccho2 1
  • How to do twice training session for the same buffer

    How to do twice training session for the same buffer

    My problem is actually creating some buffer but what stoped me boils down to not being able to do two training session if it is the same buffer the first one succed the second one don't for example v_l,p_l,e_l,g_n,v_n = self.train(episode_buffer,sess,gamma,0.0) v_l,p_l,e_l,g_n,v_n = self.train(episode_buffer,sess,gamma,0.0)

    I get this error Exception in thread Thread-144: Traceback (most recent call last): File "C:\Users\PC\Miniconda3\envs\nnseries\lib\threading.py", line 916, in _bootstrap_inner self.run() File "C:\Users\PC\Miniconda3\envs\nnseries\lib\threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "", line 37, in worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver) File "", line 314, in work self.train(fulllist[0],sess,gamma,0.0) File "", line 62, in train feed_dict=feed_dict) File "C:\Users\PC\Miniconda3\envs\nnseries\lib\site-packages\tensorflow\python\client\session.py", line 877, in run run_metadata_ptr) File "C:\Users\PC\Miniconda3\envs\nnseries\lib\site-packages\tensorflow\python\client\session.py", line 1076, in _run str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape () for Tensor 'worker_0/Placeholder_1:0', which has shape '(1, 256)' some how the shape of the buffer changes but doesn't when I do .shape

    opened by dark16sider 0
  • A3C Doom: Why there should be no more workers than there are threads on CPU?

    A3C Doom: Why there should be no more workers than there are threads on CPU?

    Hi there,

    The number of available CPU threads in my machine should be 16. However, I tested the number of workers in CPU only A3C-Doom because I set, import os os.environ["CUDA_VISIBLE_DEVICES"]="" import tensorflow Actually, there can be more workers than 16. I'm confused about this. Do you have any ideas about why?

    opened by ZhanPython 0
Owner
Arthur Juliani
Arthur Juliani
Automatically download the cwru data set, and then divide it into training data set and test data set

Automatically download the cwru data set, and then divide it into training data set and test data set.自动下载cwru数据集,然后分训练数据集和测试数据集

null 6 Jun 27, 2022
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR2022)[paper] Authors: Chenhang He, Ruihuang Li, Shuai Li, L

Billy HE 141 Dec 30, 2022
Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet.

Ravens is a collection of simulated tasks in PyBullet for learning vision-based robotic manipulation, with emphasis on pick and place. It features a Gym-like API with 10 tabletop rearrangement tasks, each with (i) a scripted oracle that provides expert demonstrations (for imitation learning), and (ii) reward functions that provide partial credit (for reinforcement learning).

Google Research 367 Jan 9, 2023
Reinforcement Learning with Q-Learning Algorithm on gym's frozen lake environment implemented in python

Reinforcement Learning with Q Learning Algorithm Q learning algorithm is trained on the gym's frozen lake environment. Libraries Used gym Numpy tqdm P

null 1 Nov 10, 2021
Reinforcement learning framework and algorithms implemented in PyTorch.

Reinforcement learning framework and algorithms implemented in PyTorch.

Robotic AI & Learning Lab Berkeley 2.1k Jan 4, 2023
MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

Microsoft 5.7k Jan 9, 2023
Tensorflow implementation of Human-Level Control through Deep Reinforcement Learning

Human-Level Control through Deep Reinforcement Learning Tensorflow implementation of Human-Level Control through Deep Reinforcement Learning. This imp

Devsisters Corp. 2.4k Dec 26, 2022
Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

CQL-JAX This repository implements Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX (FLAX). Implementation is built on

Karush Suri 8 Nov 7, 2022
Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

Manav Mishra 4 Apr 15, 2022
ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

Hao Su's Lab, UCSD 48 Dec 30, 2022
A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

null 195 Dec 7, 2022
This is our ARTS test set, an enriched test set to probe Aspect Robustness of ABSA.

This is the repository for our 2020 paper "Tasty Burgers, Soggy Fries: Probing Aspect Robustness in Aspect-Based Sentiment Analysis". Data We provide

null 35 Nov 16, 2022
Open-Set Recognition: A Good Closed-Set Classifier is All You Need

Open-Set Recognition: A Good Closed-Set Classifier is All You Need Code for our paper: "Open-Set Recognition: A Good Closed-Set Classifier is All You

null 194 Jan 3, 2023
Script that receives an Image (original) and a set of images to be used as "pixels" in reconstruction of the Original image using the set of images as "pixels"

picinpics Script that receives an Image (original) and a set of images to be used as "pixels" in reconstruction of the Original image using the set of

RodrigoCMoraes 1 Oct 24, 2021
Trading environnement for RL agents, backtesting and training.

TradzQAI Trading environnement for RL agents, backtesting and training. Live session with coinbasepro-python is finaly arrived ! Available sessions: L

Tony Denion 164 Oct 30, 2022
Lux AI environment interface for RLlib multi-agents

Lux AI interface to RLlib MultiAgentsEnv For Lux AI Season 1 Kaggle competition. LuxAI repo RLlib-multiagents docs Kaggle environments repo Please let

Jaime 12 Nov 7, 2022
PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Saim Wani 4 May 8, 2022
A lightweight Python-based 3D network multi-agent simulator. Uses a cell-based congestion model. Calculates risk, loudness and battery capacities of the agents. Suitable for 3D network optimization tasks.

AMAZ3DSim AMAZ3DSim is a lightweight python-based 3D network multi-agent simulator. It uses a cell-based congestion model. It calculates risk, battery

Daniel Hirsch 13 Nov 4, 2022