Implementation of algorithms for continuous control (DDPG and NAF).

Ilya Kostrikov

Last update: Dec 31, 2022

Related tags

Overview

DEPRECATION

This repository is deprecated and is no longer maintaned. Please see a more recent implementation of RL for continuous control at jax-sac.

Description

Reimplementation of Continuous Deep Q-Learning with Model-based Acceleration and Continuous control with deep reinforcement learning.

Contributions are welcome. If you know how to make it more stable, don't hesitate to send a pull request.

Run

Use the default hyperparameters.

For NAF:

python main.py --algo NAF --env-name HalfCheetah-v2

For DDPG

python main.py --algo DDPG --env-name HalfCheetah-v2

Comments

Having trouble running code with HalfCheetah

I tried running the code with HalfCheetah and commented out the wrappers.Monitor(...) line and any line that rendered the result. I get the error:

Traceback (most recent call last): File "main.py", line 98, in agent.update_parameters(batch) File "/home/sbhupatiraju/pytorch-ddpg-naf/naf.py", line 126, in update_parameters loss.backward() File "/home/sbhupatiraju/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 156, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "/home/sbhupatiraju/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py", line 98, in backward variables, grad_variables, retain_graph) RuntimeError: element 0 of variables tuple is volatile

Any idea on what might be going on?

opened by suryabhupa 9
Add parameter noise to ddpg

I implemented parameter noise for DDPG, as described here: https://blog.openai.com/better-exploration-with-parameter-noise/

I found that the trained policies resulting from DDPG with OU noise were similar to the one shown in the blog post above; the agent flipped onto its head and scooted forward, with the reward plateauing around 1000. With parameter noise, the learned policy looked more like running and ended with rewards of 3000+.

I modified the policy network architecture for DDPG to include layernorm, which is necessary for using parameter noise.

Parameter noise is not implemented yet for NAF, because I am not too familiar with the algorithm. The only change in the NAF code is an extra argument to select_action to make the interface the same as DDPG’s.

I also added utility functions for saving and loading networks.

To-do: decay parameter noise like how it’s done for OU noise

opened by kazuotani14 3
Typo in the Implementation

https://github.com/ikostrikov/pytorch-ddpg-naf/blob/d2e587a741e5220dfc6eceb7896ba7d6fe5c2630/naf.py#L121 current Target: r_t + \gamma * mask + v_{t+1} correct Target: r_t + \gamma * mask * v_{t+1}

opened by Akella17 1
pytorch-rl

Implemented DDPG and tried to make it as close as possible to your implementation of NAF.

I've also made some changes in NAF (seems to work better now, at least for Pendulum-v0).

(Both algorithms seem to give similar results for Pendulum-v0 but NAF has a bit more variance compared to DDPG).

opened by pranz24 1

Error out of memory

Hi,

I am doing some work about RL, and very interested in the two algorithms. I have tried to train your models both on CPU and GPU, however, both outputted "out of memory" error. The memory in use was keeping increasing. It seems that the data and/or the model in former steps are not released . And the code is very similar to the example, as follows:

    action = agent.select_action(state, ounoise, param_noise)
    next_state, reward, done, info = env.step(action.cpu().numpy()[0])
    total_numsteps += 1
    episode_reward += reward

    action = torch.Tensor(action.cpu())
    mask = torch.Tensor([not done])
    next_state = torch.Tensor(next_state.cpu())
    reward = torch.Tensor([reward])

    # pdb.set_trace()
    memory.push(state, action, mask, next_state, reward)
    state = next_state

    if len(memory) > args.batch_size:
        for _ in range(args.updates_per_step):
            transitions = memory.sample(args.batch_size)
            batch = Transition(*zip(*transitions))

            value_loss, policy_loss = agent.update_parameters(batch)

            writer.add_scalar('loss/value', value_loss, updates)
            writer.add_scalar('loss/policy', policy_loss, updates)

            updates += 1

Would you please help to solve the problem? Thanks in advance

opened by bo-wu 0

actions typo?

https://github.com/ikostrikov/pytorch-ddpg-naf/blob/7870655178fe93a4eba5450257323f00ff8a227f/normalized_actions.py#L16

Should this return action, not actions?

opened by horacepan 0
NAF Implementation not working!

The NAF algorithm does not work on Pendulum or any of the PyBullet environments. @ikostrikov Do you have any guesses why that might be the case? Which environments did you experiment with this code on? In case you used different hyperparameters than the default values, could you mention the changes that need to be made to get the NAF algorithm working.

opened by Akella17 2
t was not defined, I assumed to be equal to batch size and computing …

For adaptive noise estimation, need to get some states for the expectation operator and compute the ddpg distance metric between perturbed and non perturbed action but getting samples index "t" was not defined, assuming it is batch size we can now compute it

opened by gsp-27 0
fixed NAF reward discount

Hi Ilya,

Thanks for your open-source implementation of DDPG/NAF in pytorch.

We spotted a typo in NAF: the discount factor (and the done mask) should multiply the next_state_values instead of adding up to it.

Cheers, Jacques

opened by jackokaiser 0
benchmarking the repo

Hi @ikostrikov , I appreciate your implementation, and I wonder if you've benchmarked your implementation? If so, can I have some roughly results. Many thanks!

opened by andrewliao11 1

AttributeError for gradient clipping

hi ikostrikov, I got this error when running your code

Traceback (most recent call last):
  File "main.py", line 89, in <module>
    agent.update_parameters(batch)
  File "/home/andrewliao11/Work/pytorch-naf/naf.py", line 121, in update_parameters
    param.grad.data.clamp(-1, 1)
AttributeError: 'NoneType' object has no attribute 'data'

the original code is:

for param in self.model.parameters():
            param.grad.data.clamp(-1, 1)

maybe we should modify in into:

torch.nn.utils.clip_grad_norm(self.model.parameters(), 1)

I'm just a newbie to pytorch, not sure if it's right, thx!

opened by andrewliao11 0

Implementation of algorithms for continuous control (DDPG and NAF).

Related tags

Overview

DEPRECATION

Description

Run

For NAF:

For DDPG

Comments

Owner

Ilya Kostrikov

MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

ROS-UGV-Control-Interface - Control interface which can be used in any UGV

Hand Gesture Volume Control is AIML based project which uses image processing to control the volume of your Computer.

Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

A clean and robust Pytorch implementation of PPO on continuous action space.

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

PyTorch implementation for MINE: Continuous-Depth MPI with Neural Radiance Fields

PyTorch implementation for Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition.

Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

MiraiML: asynchronous, autonomous and continuous Machine Learning in Python

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

SurfEmb (CVPR 2022) - SurfEmb: Dense and Continuous Correspondence Distributions

Learning Continuous Image Representation with Local Implicit Image Function

[CVPR'21] FedDG: Federated Domain Generalization on Medical Image Segmentation via Episodic Learning in Continuous Frequency Space

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

Continuous Query Decomposition for Complex Query Answering in Incomplete Knowledge Graphs

A tight inclusion function for continuous collision detection

On the model-based stochastic value gradient for continuous reinforcement learning