Solutions of Reinforcement Learning 2nd Edition


Solutions of Reinforcement Learning 2nd Edition (Original Book by Richard S. Sutton,Andrew G. Barto)

How to contribute and current situation (9/11/2021~)

I have been working as a full-time AI engineer and barely have free time to manage this project any more. I want to make a simple guidance of how to response to contributions:

For exercises that have no answer yet, (for example, chapter 12)

  1. Prepare your latex code, make sure it works and looks somewhat nice.
  2. Send you code to [email protected]. By default, I will put contributer's name in the pdf file, besides the exercise. You can be anoymous as well just noted in the email.
  3. I will update the corresponding solution pdf.

For solution that you think is wrong, but it is trivial to change:

  1. Ask in issues. If there are multiple confirmations and reports to the same issue, I will change the excercise. (the pass rate of such issue is around 30%)

For solution that you think is wrong or incomplete, but it is hard to say that in issue.

Follow the first steps (just as if this exercise has no solution)

I know there is an automatic-ish commit and contribute to pdf procedure, but from the number of contributions, I decide to pass it on. (currently only 2% is contributed by person other than me)

Now I am more concentrated on computer vision and have less time contributing to the interest (RL). But I do hope and think RL is the future subject that will be on the top of AI pyramid one day and I will come back. Thanks for all your supports and best wishes to your own careers.

Those students who are using this to complete your homework, stop it. This is written for serving millions of self-learners who do not have official guide or proper learning environment. And, Of Course, as a personal project, it has ERRORS. (Contribute to issues if you find any).

Welcome to this project. It is a tiny project where we don't do too much coding (yet) but we cooperate together to finish some tricky exercises from famous RL book Reinforcement Learning, An Introduction by Sutton. You may know that this book, especially the second version which was published last year, has no official solution manual. If you send your answer to the email address that the author leaved, you will be returned a fake answer sheet that is incomplete and old. So, why don't we write our own? Most of problems are mathematical proof in which one can learn the therotical backbone nicely but some of them are quite challenging coding problems. Both of them will be updated gradually but math will go first.

Main author would be me and current main cooperater is Jean Wissam Dupin, and before was Zhiqi Pan (quitted now).

Main Contributers for Error Fixing:

burmecia's Work (Error Fix and code contribution)

Chapter 3: Ex 3.4, 3.5, 3.6, 3.9, 3.19

Chapter4: Ex 4.7 Code(in Julia)

Jean's Work (Error Fix):

Chapter 3: Ex 3.8, 3.11, 3.14, 3.23, 3.24, 3.26, 3.28, 3.29, 4.5

QihuaZhong's Work (Error fix, analysis)

Ex 6.11, 5.11, 10.5, 10.6

luigift's Work (Error fix, algorithm contribution)

Ex 10.4 10.6 10.7 Ex 12.1 (alternative solution)

Other people (Error Fix):

Ex 10.2 SHITIANYU-hue Ex 10.6 10.7 Mohammad Salehi


Don't even expect the solutions be perfect, there are always mistakes. Especially in Chapter 3, where my mind was in a rush there. And, sometimes the problems are just open. Show your ideas and question them in 'issues' at any time!

Let's roll'n out!


Will update and revise this repo after 2021 April

[UPDATE APRIL 2020] After implementing Ape-X and D4PG in my another project, I will go back to this project and at least finish the policy gradient chapter.

[UPDATE MAR 2020] Chapter 12 almost finished and is updated, except for the last 2 questions. One for dutch trace and one for double expected SARSA. They are tricker than other exercises and I will update them little bit later. Please share your ideas by opening issues if you already hold a valid solution.**

[UPDATE MAR 2020] Due to multiple interviews ( it is interview season in japan ( despite the virus!)), I have to postpone the plan of update to March or later, depending how far I could go. (That means I am doing leetcode-ish stuff every day)

[UPDATE JAN 2020] Future works will NOT be stopped. I will try to finish it in FEB 2020.

[UPDATE JAN 2020] Chapter 12's ideas are not so hard but questions are very difficult. (most chanllenging one in this book ). As far, I have finished up to Ex 12.5 and I think my answer of Ex 12.1 is the only valid one on the internet (or not, challenge welcomed!) But because later half is even more challenging (tedious when it is related to many infiite sums), I would release the final version little bit later.

[UPDATE JAN 2020] Chapter 11 updated. One might have to read the referenced link to Sutton's paper in order to understand some part. Espeically how and why Emphatic-TD works.

[UPDATE JAN 2020] Chapter 10 is long but interesting! Move on!

[UPDATE DEC 2019] Chapter 9 takes long time to read thoroughly but practices are surprisingly just a few. So after uploading the Chapter 9 pdf and I really do think I should go back to previous chapters to complete those programming practices.

Chapter 12

[Updated March 27] Almost finished.


Chapter 11

Major challenges about off-policy learning. Like Chapter 9, practices are short.


Chapter 10

It is a substantial complement to Chapter 9. Still many open problems which are very interesting.


Chapter 9

Long chapter, short practices.


Chapter 8

Finished without programming. Plan on creating additional exercises to this Chapter because many materials are lack of practice.


Chapter 7

Finished without programming. Thanks for help from Zhiqi Pan.


Chapter 6

Fully finished.


Chapter 5

Partially finished.


Chapter 4

Finished. Ex4.7 Partially finished. Dat DP question will burn my mind and macbook but I encourage any one who cares nothing about that trying to do yourself. Running through it forces you remember everything behind ordinary DP.:)


Chapter 3 (I was in a rush in this chapter. Be aware about strange answers if any.)


  • ex 8.5

    ex 8.5

    In exercise 8.5, I believe that stochastic environments refers to stochastic state transition as well as reward. Thus, the model table should represent p(s', r|s, a) and not only p(r| s, s', a), creating a distribution model rather than a sample model. When queried the model would generate sample transitions following p(s', r | s, a).

    I believe that in order to cope with changing environments some sort of exploratory reward should be given following the intuition of Dyna-Q+. Decaying experience would take time to shift the estimate of Q.

    opened by luigift 10
  • Ex 6.5

    Ex 6.5

    Screen Shot 2020-03-09 at 9 29 40 PM

    The first point is valid: alpha is not small enough. But the second point is not relevant. The 'down and up again' behaviour of the error is caused by the initialization value of 0.5, which is exactly the true value of the state c. This means during the first few episodes when state c is not updated yet, it has the 0 error. But once it gets updated, the estimated value for state c will fluctuate around the true value for state c but not very close (the higher alpha, the more fluctuating).

    Therefore, the initialization of exactly 0.5 causes the down and up again phenomenon.

    The following figure empirically demonstrates situations when states are initialized with a different value. It's clear only 0.5 initialization has this issue. image

    If we break down the squared error of each state a,b,c,d and e (all initialized with 0.5) over training episodes, it again shows state c is the cause.


    opened by qihuazhong 8
  • Exercise 3.23

    Exercise 3.23


    I'm not sure about the solution here.

    Firstly, it seems the Bellman optimality equation for q_{*}(s,a) is wrong. Looks like there's a typo where the sum $r + \gamma*max_{a'}q_{*}(s',a')$ has been replaced into the product r_{\gamma}*max_{a'}q_{*}(s',a').

    This error has then been propagated to the two solutions given as examples.

    If we ignore this typo, I agree with the solution for q_{*}(high, search).

    However, I think the answer for q_{*}(high, wait) is wrong. If the action 'wait' is taken, then the next state s' must be 'high' for which the next actions a' = {'high', 'wait'} are possible. I think the answer should be

    q_{*}{high, wait) = [r_{wait} + \gamma * max{q_{*}(high, wait), q_{*}(high, search)}]

    What do you think?

    opened by wissam124 4
  • Ex 6.7

    Ex 6.7

    Hi, the off-policy algorithm suggesed in the pdf is based on the alg on page 110, which is a MC algorithm (running weighted mean, full-episode returns, Q approximations). It may be clearer if the solution is based on the TD algorithm given in page 120 (bootstrapping, single-step rewards, V approximations). Here is a suggestion:

    Input: target policy π, behavior policy b with coverage of π
    Algorithm param: step size α∈(0,1], discount factor γ∈[0,1]
    Initialize V(s) for all s∈S+ arbitrarily and V(terminal) = 0.
    For each episode:
      S <- Initial state
      For each step while S not terminal:
          A <- sample b(a|S)
          R, S' <- take action A
          ρ <- π(A|S)/b(A|S)
          Vπ(S) <- Vπ(S) + α[ρR + γVπ(S') - Vπ(S)]
          S <- S'

    What do you think? Best,

    opened by tomasruizt 3
  • Ex 4.7-A

    Ex 4.7-A

    Line 151 : pi[(i, j)] = 0 Why are all pi[(i,j)] initialized to 0???

    Shouldn't pi[(i,j)] -> a ?? i.e., randomly mapped to actions in set A = {-5,-4,-3,-2,-1, 0, 1,2,3,4,5}

    opened by Avalpreet 2
  • Ex 3.5

    Ex 3.5

    Being more precise in the solution, I think s belongs to S (not S+), since the dynamics would not make much sense for the terminal state, i.e., there are no possible next states or even actions.

    opened by franzoni315 2
  • Ch 10 Ex. 10.6

    Ch 10 Ex. 10.6

    Hi I'm currently going through the exercises for this book as well:

    In your solution to 10.6 how did you get from

    to ?
    opened by KimMatt 2
  • ex4.5 solution modification

    ex4.5 solution modification

    Hi, in Policy Improvement of ex4.5, isn't is more clear and succinct to update the policy by selecting the action with maximum q value for every state? such as following:

    policy_stable <- true For each s \in S: old_action <- \pi (s) \pi(s) <- argmax_{a} q(s,a) ......(same as original solution)

    opened by xinyuan-huang 2
  • Ex 6.13

    Ex 6.13

    I think the update equations for Double Expected Sarsa with epsilon-greedy target policy can be:


    Q_{1}(S_{t},A_{t})\leftarrow Q_{1}(S_{t},A_{t}) + \alpha\left[R_{t+1}+\gamma\sum_a\pi(a|S_{t+1})Q_{2}(S_{t+1},a)-Q_{1}(S_{t},A_{t})\right]



    \pi(a|s)=\begin{cases}1-\epsilon+\frac{\epsilon}{|A(s)|}, & if a=argmax_{a}(Q_{1}(s,a')+Q_{2}(s,a'))\\\frac{\epsilon}{|A(s)|}, & otherwise\end{cases}
    opened by burmecia 2
  • Question 3.22

    Question 3.22

    Numerical error: when gamma = 0.9, v_right = 9.5 and v_left = 5.3, rounded to two significant figures. The former sums over odd powers of gamma, the latter even powers.

    Might also want to check gamma = 0.5

    opened by openerror 2
  • Exercise 3.19

    Exercise 3.19

    Hi, I think that one has to multiply the value of node s' (v_\pi (s')) by the discounting factor gamma. Thank you for your work, it is helping me a lot!

    opened by Qcaria 2
  • Exercise 10.5

    Exercise 10.5

    I also found the wording of this question confusing. My best guess is to be "how would the differential TD(0) algorithm be different from tabular TD(0)?" Like you, I also came up with the update formula for the weight vector. (10.10) gives us the TD error, assuming we have the average reward estimate R_bar. From there, I think the only thing you're missing to create the differential TD(0) algorithm is the update for R_bar, which uses the TD error.

    In tabular TD(0), we have a single line that updates V(S). For differential TD(0), I think we need to expand that to the following 3 lines to update the weights vector.


    Let me know if you think that sounds reasonable.

    Also, since you have done a lot of work to produce these solutions, you might want to see if Rich Sutton would honor the offer to provide book solutions if you email him your answers :) He said he would on his site! Your answers have been invaluable as I work through the textbook, and I'd also be curious to know how close you are to the book solutions.

    opened by ShawnHymel 0
  • Improved script for ex. 4.7

    Improved script for ex. 4.7

    Hello, I have submitted my code for the exercise 4.7, in place of the current one.

    • It is optimized using vectorization and memoization. It takes under a minute to compute the optimal policy.
    • The code is configurable through global constants
    • The additional constraints from the exercise 4.7 have been implemented
    • The code produces a pretty plot of the policy and of the value function

    I also didn't get results exactly equal to that of the book, but they're very close.

    Plot of the base problem as presented in example 4.2: image

    After introducing the two nonlinearities: image

    opened by CorentinJ 0
  • [Ex 4.5] Deterministic policy

    [Ex 4.5] Deterministic policy

    In your pseudocode for calculating q*, if π is deterministic (as stated in initialization and in pseudocode given for v*), then you don't need to loop on all a∈A in step 2 and you don't need to a to ponderate on all a' for the Q(s,a) calculation.

    Again, in step 3 you shouldn't loop on a because you get old-action with the deterministic policy.

    Thanks for considering this fix ;) Have a nice day !

    opened by Jonathan2021 0
  • [Ex 4.2] Changing dynamics changes the state values

    [Ex 4.2] Changing dynamics changes the state values

    The state 15 (with state 13's dynamics changed), isn't equivalent to state 13. It is further away from the upper left terminal state but closer to lower right (left, right and down are equivalent to state 13, but up makes it closer to lower right than up in state 13). I ran your script (by the way going left and right in state 15 leads to 12 and 14 respectively and not to state 15 as written in your script), added a print in the draw function for state 15 and you can see that the decimals are not the same as for state 13. You have to recalculate the whole game. All the states changed slightly (those further away changed less) if you take the decimals into account (compared to running your script for 4.1 which by the way prints value-1 in the board for some weird reason but the accurate state value list is ok). Thanks for your efforts in providing a correction for the exercises !

    opened by Jonathan2021 1
Simple Linear 2nd ODE Solver GUI - A 2nd constant coefficient linear ODE solver with simple GUI using euler's method

Simple_Linear_2nd_ODE_Solver_GUI Description It is a 2nd constant coefficient li

:) 4 Feb 5, 2022
This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

Backtracking Project Sponsors This is a project made by UCU students: Olha Liuba - crossword solver implementation Hanna Yershova - sudoku solver impl

Dasha 4 Oct 17, 2021
2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.

TableMASTER-mmocr Contents About The Project Method Description Dependency Getting Started Prerequisites Installation Usage Data preprocess Train Infe

Jianquan Ye 298 Dec 21, 2022
Kaggle G2Net Gravitational Wave Detection : 2nd place solution

Kaggle G2Net Gravitational Wave Detection : 2nd place solution

Hiroshechka Y 33 Dec 26, 2022
The 2nd Version Of Slothybot

SlothyBot Go to this website: "" The 2nd Version Of Slothybot. The Bot Has Many Features, Such As: Moderation Commands; Kic

Slothy 0 Jun 1, 2022
Xview3 solution - XView3 challenge, 2nd place solution

Xview3, 2nd place solution test split aggregate score publ

Selim Seferbekov 24 Nov 23, 2022
This is the solution for 2nd rank in Kaggle competition: Feedback Prize - Evaluating Student Writing.

Feedback Prize - Evaluating Student Writing This is the solution for 2nd rank in Kaggle competition: Feedback Prize - Evaluating Student Writing. The

Udbhav Bamba 41 Dec 14, 2022
Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

CQL-JAX This repository implements Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX (FLAX). Implementation is built on

Karush Suri 8 Nov 7, 2022
Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

Manav Mishra 4 Apr 15, 2022
MATLAB codes of the book "Digital Image Processing Fourth Edition" converted to Python

Digital Image Processing Python MATLAB codes of the book "Digital Image Processing Fourth Edition" converted to Python TO-DO: Refactor scripts, curren

Merve Noyan 24 Oct 16, 2022
Toontown House CT Edition

Toontown House: Classic Toontown House Classic source that should just work. ❓ W

Open Source Toontown Servers 5 Jan 9, 2022
The 7th edition of NTIRE: New Trends in Image Restoration and Enhancement workshop will be held on June 2022 in conjunction with CVPR 2022.

NTIRE 2022 - Image Inpainting Challenge Important dates 2022.02.01: Release of train data (input and output images) and validation data (only input) 2

Andrés Romero 37 Nov 27, 2022
Experimental solutions to selected exercises from the book [Advances in Financial Machine Learning by Marcos Lopez De Prado]

Advances in Financial Machine Learning Exercises Experimental solutions to selected exercises from the book Advances in Financial Machine Learning by

Brian 1.4k Jan 4, 2023
Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

Modeling High-Frequency Limit Order Book Dynamics Using Machine Learning Framework to capture the dynamics of high-frequency limit order books. Overvi

Chang-Shu Chung 1.3k Jan 7, 2023
A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

null 195 Dec 7, 2022
Exact Pareto Optimal solutions for preference based Multi-Objective Optimization

Exact Pareto Optimal solutions for preference based Multi-Objective Optimization

Debabrata Mahapatra 40 Dec 24, 2022
BMW TechOffice MUNICH 148 Dec 21, 2022
🏅 The Most Comprehensive List of Kaggle Solutions and Ideas 🏅

?? Collection of Kaggle Solutions and Ideas ??

Farid Rashidi 2.3k Jan 8, 2023
LeetCode Solutions

leetcode LeetCode Solutions groupped by common patterns YouTube: Telegram: Problems source

Vlad Ten 158 Dec 29, 2022