(NeurIPS '21 Spotlight) IQ-Learn: Inverse Q-Learning for Imitation

Overview

Inverse Q-Learning (IQ-Learn)

Official code base for IQ-Learn: Inverse soft-Q Learning for Imitation, NeurIPS '21 Spotlight

IQ-Learn is an easy-to-use algorithm that's a drop-in replacement to methods like Behavior Cloning and GAIL, to boost your imitation learning pipelines!
Update: IQ-Learn was recently used to create the best AI agent for playing Minecraft. Placing #1 in NeurIPS MineRL Basalt Challenge using only human demos (Overall Leaderboard Rank #2)

[Project Page]

We introduce Inverse Q-Learning (IQ-Learn), a state-of-the-art novel framework for Imitation Learning (IL), that directly learns soft-Q functions from expert data. IQ-Learn enables non-adverserial imitation learning, working on both offline and online IL settings. It is performant even with very sparse expert data, and scales to complex image-based environments, surpassing prior methods by more than 3x. It is very simple to implement requiring ~15 lines of code on top of existing RL methods.

Inverse Q-Learning is theoretically equivalent to Inverse Reinforcement learning, i.e. learning rewards from expert data. However, it is much more powerful in practice. It admits very simple non-adverserial training and works on complete offline IL settings (without any access to the environment), greatly exceeding Behavior Cloning.

IQ-Learn is the successor to Adversarial Imitation Learning methods like GAIL (coming from the same lab).
It extends the theoretical framework for Inverse RL to non-adverserial and scalable learning, for the first-time showing guaranteed convergence.

Citation

@inproceedings{garg2021iqlearn,
title={IQ-Learn: Inverse soft-Q Learning for Imitation},
author={Divyansh Garg and Shuvam Chakraborty and Chris Cundy and Jiaming Song and Stefano Ermon},
booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
year={2021},
url={https://openreview.net/forum?id=Aeo-xqtb5p}
}

Key Advantages

Drop-in replacement to Behavior Cloning
Non-adverserial online IL (Successor to GAIL & AIRL)
Simple to implement
Performant with very sparse data (single expert demo)
Scales to Complex Image Envs (SOTA on Atari and playing Minecraft)
Recover rewards from envs

Usage

To install and use IQ-Learn check the instructions provided in the iq_learn folder.

Imitation

Reaching human-level performance on Atari with pure imitation:

Rewards

Recovering environment rewards on GridWorld:

Grid

Questions

Please feel free to email us if you have any questions.

Div Garg ([email protected])

Comments
  • Issue on reproduce MuJoCo results-HalfCheetah-v2

    Issue on reproduce MuJoCo results-HalfCheetah-v2

    Dear Author, it's an honor to see your paper and code! I am a novice in this area and now I am trying to reproduce the effect of your experiment, but I have encountered some obstacles. In Half-Cheetah, I don't get the 5076.6 points in the paper, even my reward is less than 0 in most cases, the code is not modified, is the reason the hyperparameter setting? If so, could you share your hyperparameter setting? Thanks for sharing!

    opened by shuoye1000 14
  • How to judge the convergence

    How to judge the convergence

    Hello!

    Thanks so much for sharing the code!

    I am new at inverse reinforcement learning. Now I am trying to apply the code to a customized environment without knowing anything about the reward function. So are there any metrics that can be used to judge the convergence except for rewards?

    Thanks ;).

    opened by 18627242758 9
  • Issue on reproduce MuJoCo results

    Issue on reproduce MuJoCo results

    Hello! Could you provide the hyperparameters and the number of training steps of each MuJoCo env to reproduce the Table5 results?(Appendix D.2 in the original paper) I've tried the iq_learn/scripts/run_mujoco.sh script to train on Ant-v2 for ~300k steps with default hyperparameters and 10 expert trajectories. But only got the eval returns around 3000~4000. The eval/episode_reward shows 3301.59521 and the best_returns is 4275.31665. thank you!

    opened by Ending2015a 5
  • expert datasets

    expert datasets

    When I use trajectory in iq_learn/experts/, the results are not optimal. Are these just demo datas? It seems that expert datasets from Dropbox cannot be downloaded successfully. Thanks for your help!

    opened by chenmao001 4
  • Issue on reproducing pointmaze experiments

    Issue on reproducing pointmaze experiments

    Hi, thanks for sharing your work.

    Currently I'm trying to reproduce the result in pointmaze environment. I am wondering why there is a negation in visualize_reward function in vis/maze_vis.py (line 144).

    Also, I would like to know whether only_expert_state option works in pointmaze environment. If so, is there a suitable set of hyperparameters for pointmaze environment? Thank you!

    opened by wognl0402 1
  • Pseudocode and questions

    Pseudocode and questions

    Hey thanks for sharing this work! And I really appreciate the in depth beginner friendly blog post! I was wondering if this pseudocode was

    1. Correct
    2. Helpful to anyone else trying to understand the code

    If not feel free to close. But I would appreciate it if you could help me understand a few parts about the code! Thanks!

    Questions

    1. How come the environment reward env_reward is unused and reward is entirely dependent on the output of the model? Does this algorithm only learn the expert and never take into account environment reward?
    2. Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse?

    Pseudocode

    
    def init_network():
      q_net = torch.nn.Linear(state_size, action_size)
      target_net = deepcopy(q_net)
      
    def episode_step():
      action = softmax(q_net(state))
      next_state, reward = env.step(action)
      memory.add((state, next_state, action, reward)) # memory = collections.deque
      update_critic(memory, expert_memory)
      target_net = deepcopy(q_net)
      
    def update_critic(memory, expert_memory):
      # The idea here is that we backprop both the rewards for the expert's actions and the agent's actions
      # the batch dimension contains examples from the expert and the agent
      state = torch.cat((memory[:][0], expert_memory[:][0]))
      next_state = torch.cat((memory[:][1], expert_memory[:][1]))
      action = torch.cat((memory[:][2], expert_memory[:][2]))
      # v = sum of future rewards for all possible actions given current state
      v = torch.logsumexp(q_net(state), dim=1, keepdim=True)
      # next_v = sum of future rewards for all possible actions given state(t+1)
      next_v = torch.logsumexp(q_net(next_state), dim=1, keepdim=True)
      # q = sum of future rewards predicted given current state, action pair
      q = q_net(state).gather(action) 
      loss = iq_loss(q, v, next_v)
      critic_optimizer.zero_grad()
      loss.backward()
      critic_optimizer.step()
      
    def iq_loss(q, v, next_v):
      if done:
        expert_reward = q[where_expert]
        # Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse? 
        value_loss = v.mean()
      else:
        expert_reward = (q - next_v)[where_expert]
        value_loss = (v - next_v).mean()
      # Why is this negative?
      expert_reward_loss = -expert_reward.mean()
      loss = reward_loss + value_loss
      return loss
    
    opened by djsamseng 0
  • Issue on Ant-v2 expertd data and Humanoid-v2 random seed Experiments

    Issue on Ant-v2 expertd data and Humanoid-v2 random seed Experiments

    Hi~Thank you very much for sharing your paper and source code !!! I am new to inverse RL and I want to implement your method on the robot recently. About Ant-v2

    1. And I found that the reward for each step in your Ant-v2 expert data is 1. Why set the reward like this? And how to run sqil correctly in your code

    About random seeds

    1. I found that the results with different random seeds in the humanoid experiments are very different, some results are around 1500 points, is it because the number of learning steps is only 50000 or the expert data is 1?

    I runned with this python train_iq.py env=humanoid agent=sac expert.demos=1 method.loss=v0 method.regularize=True agent.actor_lr=3e-05 seed=0/1/2/3/4/5 agent.init_temp=1 seed Your work is very valuable and I look forward to your help in solving my doubts.

    opened by XizoB 1
  • Code for gridworld experiments

    Code for gridworld experiments

    Hi, Thanks for making your code accessible!. I was wondering if it was possible you could also share the code to reproduce the gridworld experiments, specifically Fig 13 from your paper.

    opened by HareshKarnan 2
Owner
Divyansh Garg
Making robots intelligent
Divyansh Garg
This is an official PyTorch implementation of Task-Adaptive Neural Network Search with Meta-Contrastive Learning (NeurIPS 2021, Spotlight).

NeurIPS 2021 (Spotlight): Task-Adaptive Neural Network Search with Meta-Contrastive Learning This is an official PyTorch implementation of Task-Adapti

Wonyong Jeong 15 Nov 21, 2022
[NeurIPS 2021 Spotlight] Code for Learning to Compose Visual Relations

Learning to Compose Visual Relations This is the pytorch codebase for the NeurIPS 2021 Spotlight paper Learning to Compose Visual Relations. Demo Imag

Nan Liu 88 Jan 4, 2023
PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Long Short-Term Transformer for Online Action Detection Introduction This is a PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short

null 77 Dec 16, 2022
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Ilya Kostrikov 3k Dec 31, 2022
Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

STARS Laboratory 8 Sep 14, 2022
Disagreement-Regularized Imitation Learning

Due to a normalization bug the expert trajectories have lower performance than the rl_baseline_zoo reported experts. Please see the following link in

Kianté Brantley 25 Apr 28, 2022
Scripts for training an AI to play the endless runner Subway Surfers using a supervised machine learning approach by imitation and a convolutional neural network (CNN) for image classification

About subwAI subwAI - a project for training an AI to play the endless runner Subway Surfers using a supervised machine learning approach by imitation

null 82 Jan 1, 2023
ilpyt: imitation learning library with modular, baseline implementations in Pytorch

ilpyt The imitation learning toolbox (ilpyt) contains modular implementations of common deep imitation learning algorithms in PyTorch, with unified in

The MITRE Corporation 11 Nov 17, 2022
Visual Adversarial Imitation Learning using Variational Models (VMAIL)

Visual Adversarial Imitation Learning using Variational Models (VMAIL) This is the official implementation of the NeurIPS 2021 paper. Project website

null 14 Nov 18, 2022
PyTorch implementation of SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching

SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching This is the official PyTorch implementation of SMODICE: Versatile Offline I

Jason Ma 14 Aug 30, 2022
This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

Deepender Singla 1.4k Dec 22, 2022
Predicting path with preference based on user demonstration using Maximum Entropy Deep Inverse Reinforcement Learning in a continuous environment

Preference-Planning-Deep-IRL Introduction Check my portfolio post Dependencies Gym stable-baselines3 PyTorch Usage Take Demonstration python3 record.

Tianyu Li 9 Oct 26, 2022
Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

GVP Transformer (wip) Implementation of the GVP-Transformer, which was used in the paper Learning inverse folding from millions of predicted structure

Phil Wang 19 May 6, 2022
Pytorch code for "State-only Imitation with Transition Dynamics Mismatch" (ICLR 2020)

This repo contains code for our paper State-only Imitation with Transition Dynamics Mismatch published at ICLR 2020. The code heavily uses the RL mach

null 20 Sep 8, 2022
This is an example implementation of the paper "Cross Domain Robot Imitation with Invariant Representation".

IR-GAIL This is an example implementation of the paper "Cross Domain Robot Imitation with Invariant Representation". Dependency The experiments are de

Zhao-Heng Yin 1 Jul 14, 2022
[CVPR 2022] PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision (Oral)

PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision Kehong Gong*, Bingbing Li*, Jianfeng Zhang*, Ta

null 256 Dec 28, 2022
Code for "The Intrinsic Dimension of Images and Its Impact on Learning" - ICLR 2021 Spotlight

dimensions Estimating the instrinsic dimensionality of image datasets Code for: The Intrinsic Dimensionaity of Images and Its Impact On Learning - Phi

Phil Pope 41 Dec 10, 2022
Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)

UPDeT Official Implementation of UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers (ICLR 2021 spotlight) The

hhhusiyi 96 Dec 22, 2022