(NeurIPS '21 Spotlight) IQ-Learn: Inverse Q-Learning for Imitation

Divyansh Garg

Last update: Dec 20, 2022

Related tags

Deep Learning IQ-Learn

Overview

Inverse Q-Learning (IQ-Learn)

Official code base for IQ-Learn: Inverse soft-Q Learning for Imitation, NeurIPS '21 Spotlight

IQ-Learn is an easy-to-use algorithm that's a drop-in replacement to methods like Behavior Cloning and GAIL, to boost your imitation learning pipelines!
Update: IQ-Learn was recently used to create the best AI agent for playing Minecraft. Placing #1 in NeurIPS MineRL Basalt Challenge using only human demos (Overall Leaderboard Rank #2)

[Project Page]

We introduce Inverse Q-Learning (IQ-Learn), a state-of-the-art novel framework for Imitation Learning (IL), that directly learns soft-Q functions from expert data. IQ-Learn enables non-adverserial imitation learning, working on both offline and online IL settings. It is performant even with very sparse expert data, and scales to complex image-based environments, surpassing prior methods by more than 3x. It is very simple to implement requiring ~15 lines of code on top of existing RL methods.

Inverse Q-Learning is theoretically equivalent to Inverse Reinforcement learning, i.e. learning rewards from expert data. However, it is much more powerful in practice. It admits very simple non-adverserial training and works on complete offline IL settings (without any access to the environment), greatly exceeding Behavior Cloning.

IQ-Learn is the successor to Adversarial Imitation Learning methods like GAIL (coming from the same lab).
It extends the theoretical framework for Inverse RL to non-adverserial and scalable learning, for the first-time showing guaranteed convergence.

Citation

@inproceedings{garg2021iqlearn,
title={IQ-Learn: Inverse soft-Q Learning for Imitation},
author={Divyansh Garg and Shuvam Chakraborty and Chris Cundy and Jiaming Song and Stefano Ermon},
booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
year={2021},
url={https://openreview.net/forum?id=Aeo-xqtb5p}
}

Key Advantages

✅ Drop-in replacement to Behavior Cloning
✅ Non-adverserial online IL (Successor to GAIL & AIRL)
✅ Simple to implement
✅ Performant with very sparse data (single expert demo)
✅ Scales to Complex Image Envs (SOTA on Atari and playing Minecraft)
✅ Recover rewards from envs

Usage

To install and use IQ-Learn check the instructions provided in the iq_learn folder.

Imitation

Reaching human-level performance on Atari with pure imitation:

Rewards

Recovering environment rewards on GridWorld:

Questions

Please feel free to email us if you have any questions.

Div Garg ([email protected])

Comments

Issue on reproduce MuJoCo results-HalfCheetah-v2

Dear Author, it's an honor to see your paper and code! I am a novice in this area and now I am trying to reproduce the effect of your experiment, but I have encountered some obstacles. In Half-Cheetah, I don't get the 5076.6 points in the paper, even my reward is less than 0 in most cases, the code is not modified, is the reason the hyperparameter setting? If so, could you share your hyperparameter setting? Thanks for sharing!

opened by shuoye1000 14
How to judge the convergence

Hello!

Thanks so much for sharing the code!

I am new at inverse reinforcement learning. Now I am trying to apply the code to a customized environment without knowing anything about the reward function. So are there any metrics that can be used to judge the convergence except for rewards?

Thanks ;).

opened by 18627242758 9
Issue on reproduce MuJoCo results

Hello! Could you provide the hyperparameters and the number of training steps of each MuJoCo env to reproduce the Table5 results?(Appendix D.2 in the original paper) I've tried the iq_learn/scripts/run_mujoco.sh script to train on Ant-v2 for ~300k steps with default hyperparameters and 10 expert trajectories. But only got the eval returns around 3000~4000. The eval/episode_reward shows 3301.59521 and the best_returns is 4275.31665. thank you!

opened by Ending2015a 5
expert datasets

When I use trajectory in iq_learn/experts/, the results are not optimal. Are these just demo datas? It seems that expert datasets from Dropbox cannot be downloaded successfully. Thanks for your help!

opened by chenmao001 4
Issue on reproducing pointmaze experiments

Hi, thanks for sharing your work.

Currently I'm trying to reproduce the result in pointmaze environment. I am wondering why there is a negation in visualize_reward function in vis/maze_vis.py (line 144).

Also, I would like to know whether only_expert_state option works in pointmaze environment. If so, is there a suitable set of hyperparameters for pointmaze environment? Thank you!

opened by wognl0402 1

Pseudocode and questions

Hey thanks for sharing this work! And I really appreciate the in depth beginner friendly blog post! I was wondering if this pseudocode was

Correct
Helpful to anyone else trying to understand the code

If not feel free to close. But I would appreciate it if you could help me understand a few parts about the code! Thanks!

Questions

How come the environment reward env_reward is unused and reward is entirely dependent on the output of the model? Does this algorithm only learn the expert and never take into account environment reward?
Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse?

Pseudocode


def init_network():
  q_net = torch.nn.Linear(state_size, action_size)
  target_net = deepcopy(q_net)
  
def episode_step():
  action = softmax(q_net(state))
  next_state, reward = env.step(action)
  memory.add((state, next_state, action, reward)) # memory = collections.deque
  update_critic(memory, expert_memory)
  target_net = deepcopy(q_net)
  
def update_critic(memory, expert_memory):
  # The idea here is that we backprop both the rewards for the expert's actions and the agent's actions
  # the batch dimension contains examples from the expert and the agent
  state = torch.cat((memory[:][0], expert_memory[:][0]))
  next_state = torch.cat((memory[:][1], expert_memory[:][1]))
  action = torch.cat((memory[:][2], expert_memory[:][2]))
  # v = sum of future rewards for all possible actions given current state
  v = torch.logsumexp(q_net(state), dim=1, keepdim=True)
  # next_v = sum of future rewards for all possible actions given state(t+1)
  next_v = torch.logsumexp(q_net(next_state), dim=1, keepdim=True)
  # q = sum of future rewards predicted given current state, action pair
  q = q_net(state).gather(action) 
  loss = iq_loss(q, v, next_v)
  critic_optimizer.zero_grad()
  loss.backward()
  critic_optimizer.step()
  
def iq_loss(q, v, next_v):
  if done:
    expert_reward = q[where_expert]
    # Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse? 
    value_loss = v.mean()
  else:
    expert_reward = (q - next_v)[where_expert]
    value_loss = (v - next_v).mean()
  # Why is this negative?
  expert_reward_loss = -expert_reward.mean()
  loss = reward_loss + value_loss
  return loss

opened by djsamseng 0

Issue on Ant-v2 expertd data and Humanoid-v2 random seed Experiments
Hi~Thank you very much for sharing your paper and source code !!! I am new to inverse RL and I want to implement your method on the robot recently. About Ant-v2

And I found that the reward for each step in your Ant-v2 expert data is 1. Why set the reward like this? And how to run sqil correctly in your code

About random seeds

I found that the results with different random seeds in the humanoid experiments are very different, some results are around 1500 points, is it because the number of learning steps is only 50000 or the expert data is 1?

I runned with this python train_iq.py env=humanoid agent=sac expert.demos=1 method.loss=v0 method.regularize=True agent.actor_lr=3e-05 seed=0/1/2/3/4/5 agent.init_temp=1 Your work is very valuable and I look forward to your help in solving my doubts.
opened by XizoB 1
Code for gridworld experiments

Hi, Thanks for making your code accessible!. I was wondering if it was possible you could also share the code to reproduce the gridworld experiments, specifically Fig 13 from your paper.

opened by HareshKarnan 2

(NeurIPS '21 Spotlight) IQ-Learn: Inverse Q-Learning for Imitation

Related tags

Overview

Inverse Q-Learning (IQ-Learn)

Citation

Key Advantages

Usage

Imitation

Rewards

Questions

Comments

Questions

Pseudocode

Owner

Divyansh Garg

This is an official PyTorch implementation of Task-Adaptive Neural Network Search with Meta-Contrastive Learning (NeurIPS 2021, Spotlight).

[NeurIPS 2021 Spotlight] Code for Learning to Compose Visual Relations

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

Disagreement-Regularized Imitation Learning

Scripts for training an AI to play the endless runner Subway Surfers using a supervised machine learning approach by imitation and a convolutional neural network (CNN) for image classification

ilpyt: imitation learning library with modular, baseline implementations in Pytorch

Visual Adversarial Imitation Learning using Variational Models (VMAIL)

PyTorch implementation of SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching

This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Predicting path with preference based on user demonstration using Maximum Entropy Deep Inverse Reinforcement Learning in a continuous environment

Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

Pytorch code for "State-only Imitation with Transition Dynamics Mismatch" (ICLR 2020)

This is an example implementation of the paper "Cross Domain Robot Imitation with Invariant Representation".

[CVPR 2022] PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision (Oral)

Code for "The Intrinsic Dimension of Images and Its Impact on Learning" - ICLR 2021 Spotlight

Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)