Solutions of Reinforcement Learning 2nd Edition

YIFAN WANG

Last update: Dec 30, 2022

Related tags

Overview

Solutions of Reinforcement Learning 2nd Edition (Original Book by Richard S. Sutton,Andrew G. Barto)

How to contribute and current situation (9/11/2021~)

I have been working as a full-time AI engineer and barely have free time to manage this project any more. I want to make a simple guidance of how to response to contributions:

For exercises that have no answer yet, (for example, chapter 12)

Prepare your latex code, make sure it works and looks somewhat nice.
Send you code to [email protected]. By default, I will put contributer's name in the pdf file, besides the exercise. You can be anoymous as well just noted in the email.
I will update the corresponding solution pdf.

For solution that you think is wrong, but it is trivial to change:

Ask in issues. If there are multiple confirmations and reports to the same issue, I will change the excercise. (the pass rate of such issue is around 30%)

For solution that you think is wrong or incomplete, but it is hard to say that in issue.

Follow the first steps (just as if this exercise has no solution)

I know there is an automatic-ish commit and contribute to pdf procedure, but from the number of contributions, I decide to pass it on. (currently only 2% is contributed by person other than me)

Now I am more concentrated on computer vision and have less time contributing to the interest (RL). But I do hope and think RL is the future subject that will be on the top of AI pyramid one day and I will come back. Thanks for all your supports and best wishes to your own careers.

Those students who are using this to complete your homework, stop it. This is written for serving millions of self-learners who do not have official guide or proper learning environment. And, Of Course, as a personal project, it has ERRORS. (Contribute to issues if you find any).

Welcome to this project. It is a tiny project where we don't do too much coding (yet) but we cooperate together to finish some tricky exercises from famous RL book Reinforcement Learning, An Introduction by Sutton. You may know that this book, especially the second version which was published last year, has no official solution manual. If you send your answer to the email address that the author leaved, you will be returned a fake answer sheet that is incomplete and old. So, why don't we write our own? Most of problems are mathematical proof in which one can learn the therotical backbone nicely but some of them are quite challenging coding problems. Both of them will be updated gradually but math will go first.

Main author would be me and current main cooperater is Jean Wissam Dupin, and before was Zhiqi Pan (quitted now).

Main Contributers for Error Fixing:

burmecia's Work (Error Fix and code contribution)

Chapter 3: Ex 3.4, 3.5, 3.6, 3.9, 3.19

Chapter4: Ex 4.7 Code(in Julia)

Jean's Work (Error Fix):

Chapter 3: Ex 3.8, 3.11, 3.14, 3.23, 3.24, 3.26, 3.28, 3.29, 4.5

QihuaZhong's Work (Error fix, analysis)

Ex 6.11, 5.11, 10.5, 10.6

luigift's Work (Error fix, algorithm contribution)

Ex 10.4 10.6 10.7 Ex 12.1 (alternative solution)

Other people (Error Fix):

Ex 10.2 SHITIANYU-hue Ex 10.6 10.7 Mohammad Salehi

ABOUT MISTAKES:

Don't even expect the solutions be perfect, there are always mistakes. Especially in Chapter 3, where my mind was in a rush there. And, sometimes the problems are just open. Show your ideas and question them in 'issues' at any time!

Let's roll'n out!

UPDATE LOG:

Will update and revise this repo after 2021 April

[UPDATE APRIL 2020] After implementing Ape-X and D4PG in my another project, I will go back to this project and at least finish the policy gradient chapter.

[UPDATE MAR 2020] Chapter 12 almost finished and is updated, except for the last 2 questions. One for dutch trace and one for double expected SARSA. They are tricker than other exercises and I will update them little bit later. Please share your ideas by opening issues if you already hold a valid solution.**

[UPDATE MAR 2020] Due to multiple interviews ( it is interview season in japan ( despite the virus!)), I have to postpone the plan of update to March or later, depending how far I could go. (That means I am doing leetcode-ish stuff every day)

[UPDATE JAN 2020] Future works will NOT be stopped. I will try to finish it in FEB 2020.

[UPDATE JAN 2020] Chapter 12's ideas are not so hard but questions are very difficult. (most chanllenging one in this book ). As far, I have finished up to Ex 12.5 and I think my answer of Ex 12.1 is the only valid one on the internet (or not, challenge welcomed!) But because later half is even more challenging (tedious when it is related to many infiite sums), I would release the final version little bit later.

[UPDATE JAN 2020] Chapter 11 updated. One might have to read the referenced link to Sutton's paper in order to understand some part. Espeically how and why Emphatic-TD works.

[UPDATE JAN 2020] Chapter 10 is long but interesting! Move on!

[UPDATE DEC 2019] Chapter 9 takes long time to read thoroughly but practices are surprisingly just a few. So after uploading the Chapter 9 pdf and I really do think I should go back to previous chapters to complete those programming practices.

Chapter 12

[Updated March 27] Almost finished.

CHAPTER 12 SOLUTION PDF HERE

Chapter 11

Major challenges about off-policy learning. Like Chapter 9, practices are short.

CHAPTER 11 SOLUTION PDF HERE

Chapter 10

It is a substantial complement to Chapter 9. Still many open problems which are very interesting.

CHAPTER 10 SOLUTION PDF HERE

Chapter 9

Long chapter, short practices.

CHAPTER 9 SOLUTION PDF HERE

Chapter 8

Finished without programming. Plan on creating additional exercises to this Chapter because many materials are lack of practice.

CHAPTER 8 SOLUTION PDF HERE

Chapter 7

Finished without programming. Thanks for help from Zhiqi Pan.

CHAPTER 7 SOLUTION PDF HERE

Chapter 6

Fully finished.

CHAPTER 6 SOLUTION PDF HERE

Chapter 5

Partially finished.

CHAPTER 5 SOLUTION PDF HERE

Chapter 4

Finished. Ex4.7 Partially finished. Dat DP question will burn my mind and macbook but I encourage any one who cares nothing about that trying to do yourself. Running through it forces you remember everything behind ordinary DP.:)

CHAPTER 4 SOLUTION PDF HERE

Chapter 3 (I was in a rush in this chapter. Be aware about strange answers if any.)

CHAPTER 3 SOLUTION PDF HERE

Comments

ex 8.5

In exercise 8.5, I believe that stochastic environments refers to stochastic state transition as well as reward. Thus, the model table should represent p(s', r|s, a) and not only p(r| s, s', a), creating a distribution model rather than a sample model. When queried the model would generate sample transitions following p(s', r | s, a).

I believe that in order to cope with changing environments some sort of exploratory reward should be given following the intuition of Dyna-Q+. Decaying experience would take time to shift the estimate of Q.

opened by luigift 10
Ex 6.5

The first point is valid: alpha is not small enough. But the second point is not relevant. The 'down and up again' behaviour of the error is caused by the initialization value of 0.5, which is exactly the true value of the state c. This means during the first few episodes when state c is not updated yet, it has the 0 error. But once it gets updated, the estimated value for state c will fluctuate around the true value for state c but not very close (the higher alpha, the more fluctuating).

Therefore, the initialization of exactly 0.5 causes the down and up again phenomenon.

The following figure empirically demonstrates situations when states are initialized with a different value. It's clear only 0.5 initialization has this issue.

If we break down the squared error of each state a,b,c,d and e (all initialized with 0.5) over training episodes, it again shows state c is the cause.

opened by qihuazhong 8
Exercise 3.23

Hello,

I'm not sure about the solution here.

Firstly, it seems the Bellman optimality equation for q_{*}(s,a) is wrong. Looks like there's a typo where the sum $r + \gamma*max_{a'}q_{*}(s',a')$ has been replaced into the product r_{\gamma}*max_{a'}q_{*}(s',a').

This error has then been propagated to the two solutions given as examples.

If we ignore this typo, I agree with the solution for q_{*}(high, search).

However, I think the answer for q_{*}(high, wait) is wrong. If the action 'wait' is taken, then the next state s' must be 'high' for which the next actions a' = {'high', 'wait'} are possible. I think the answer should be

q_{*}{high, wait) = [r_{wait} + \gamma * max{q_{*}(high, wait), q_{*}(high, search)}]

What do you think?

opened by wissam124 4

Ex 6.7

Hi, the off-policy algorithm suggesed in the pdf is based on the alg on page 110, which is a MC algorithm (running weighted mean, full-episode returns, Q approximations). It may be clearer if the solution is based on the TD algorithm given in page 120 (bootstrapping, single-step rewards, V approximations). Here is a suggestion:

Input: target policy π, behavior policy b with coverage of π
Algorithm param: step size α∈(0,1], discount factor γ∈[0,1]

Initialize V(s) for all s∈S+ arbitrarily and V(terminal) = 0.
For each episode:
  S <- Initial state
  For each step while S not terminal:
      A <- sample b(a|S)
      R, S' <- take action A
      ρ <- π(A|S)/b(A|S)
      Vπ(S) <- Vπ(S) + α[ρR + γVπ(S') - Vπ(S)]
      S <- S'

What do you think? Best,

opened by tomasruizt 3

Ex 4.7-A

Line 151 : pi[(i, j)] = 0 Why are all pi[(i,j)] initialized to 0???

Shouldn't pi[(i,j)] -> a ?? i.e., randomly mapped to actions in set A = {-5,-4,-3,-2,-1, 0, 1,2,3,4,5}

opened by Avalpreet 2
Ex 3.5

Being more precise in the solution, I think s belongs to S (not S+), since the dynamics would not make much sense for the terminal state, i.e., there are no possible next states or even actions.

opened by franzoni315 2
Ch 10 Ex. 10.6

Hi I'm currently going through the exercises for this book as well: https://github.com/KimMatt/RL_Projects

In your solution to 10.6 how did you get from
$E[R_{t 1}|S_0=s] - r(\pi)$ to $\frac{-1^t}{2}$ ?

opened by KimMatt 2
ex4.5 solution modification

Hi, in Policy Improvement of ex4.5, isn't is more clear and succinct to update the policy by selecting the action with maximum q value for every state? such as following:

policy_stable <- true For each s \in S: old_action <- \pi (s) \pi(s) <- argmax_{a} q(s,a) ......(same as original solution)

opened by xinyuan-huang 2

Ex 6.13

I think the update equations for Double Expected Sarsa with epsilon-greedy target policy can be:

Q_{1}(S_{t},A_{t})\leftarrow Q_{1}(S_{t},A_{t}) + \alpha\left[R_{t+1}+\gamma\sum_a\pi(a|S_{t+1})Q_{2}(S_{t+1},a)-Q_{1}(S_{t},A_{t})\right]

where

\pi(a|s)=\begin{cases}1-\epsilon+\frac{\epsilon}{|A(s)|}, & if a=argmax_{a}(Q_{1}(s,a')+Q_{2}(s,a'))\\\frac{\epsilon}{|A(s)|}, & otherwise\end{cases}

opened by burmecia 2

Question 3.22

Numerical error: when gamma = 0.9, v_right = 9.5 and v_left = 5.3, rounded to two significant figures. The former sums over odd powers of gamma, the latter even powers.

Might also want to check gamma = 0.5

opened by openerror 2
Exercise 3.19

Hi, I think that one has to multiply the value of node s' (v_\pi (s')) by the discounting factor gamma. Thank you for your work, it is helping me a lot!

opened by Qcaria 2
Exercise 10.5

I also found the wording of this question confusing. My best guess is to be "how would the differential TD(0) algorithm be different from tabular TD(0)?" Like you, I also came up with the update formula for the weight vector. (10.10) gives us the TD error, assuming we have the average reward estimate R_bar. From there, I think the only thing you're missing to create the differential TD(0) algorithm is the update for R_bar, which uses the TD error.

In tabular TD(0), we have a single line that updates V(S). For differential TD(0), I think we need to expand that to the following 3 lines to update the weights vector.

Let me know if you think that sounds reasonable.

Also, since you have done a lot of work to produce these solutions, you might want to see if Rich Sutton would honor the offer to provide book solutions if you email him your answers :) He said he would on his site! http://incompleteideas.net/book/solutions.html. Your answers have been invaluable as I work through the textbook, and I'd also be curious to know how close you are to the book solutions.

opened by ShawnHymel 0
Improved script for ex. 4.7
Hello, I have submitted my code for the exercise 4.7, in place of the current one.

It is optimized using vectorization and memoization. It takes under a minute to compute the optimal policy.

The code is configurable through global constants

The additional constraints from the exercise 4.7 have been implemented

The code produces a pretty plot of the policy and of the value function

I also didn't get results exactly equal to that of the book, but they're very close.

Plot of the base problem as presented in example 4.2:

After introducing the two nonlinearities:
opened by CorentinJ 0
[Ex 4.5] Deterministic policy

In your pseudocode for calculating q*, if π is deterministic (as stated in initialization and in pseudocode given for v*), then you don't need to loop on all a∈A in step 2 and you don't need to a to ponderate on all a' for the Q(s,a) calculation.

Again, in step 3 you shouldn't loop on a because you get old-action with the deterministic policy.

Thanks for considering this fix ;) Have a nice day !

opened by Jonathan2021 0
[Ex 4.2] Changing dynamics changes the state values

The state 15 (with state 13's dynamics changed), isn't equivalent to state 13. It is further away from the upper left terminal state but closer to lower right (left, right and down are equivalent to state 13, but up makes it closer to lower right than up in state 13). I ran your script 4.2.py (by the way going left and right in state 15 leads to 12 and 14 respectively and not to state 15 as written in your script), added a print in the draw function for state 15 and you can see that the decimals are not the same as for state 13. You have to recalculate the whole game. All the states changed slightly (those further away changed less) if you take the decimals into account (compared to running your script for 4.1 which by the way prints value-1 in the board for some weird reason but the accurate state value list is ok). Thanks for your efforts in providing a correction for the exercises !

opened by Jonathan2021 1

Solutions of Reinforcement Learning 2nd Edition

Related tags

Overview

Solutions of Reinforcement Learning 2nd Edition (Original Book by Richard S. Sutton,Andrew G. Barto)

How to contribute and current situation (9/11/2021~)

For exercises that have no answer yet, (for example, chapter 12)

For solution that you think is wrong, but it is trivial to change:

For solution that you think is wrong or incomplete, but it is hard to say that in issue.

Those students who are using this to complete your homework, stop it. This is written for serving millions of self-learners who do not have official guide or proper learning environment. And, Of Course, as a personal project, it has ERRORS. (Contribute to issues if you find any).

Main Contributers for Error Fixing:

burmecia's Work (Error Fix and code contribution)

Jean's Work (Error Fix):

QihuaZhong's Work (Error fix, analysis)

luigift's Work (Error fix, algorithm contribution)

Other people (Error Fix):

ABOUT MISTAKES:

UPDATE LOG:

Chapter 12

Chapter 11

Chapter 10

Chapter 9

Chapter 8

Chapter 7

Chapter 6

Chapter 5

Chapter 4

Chapter 3 (I was in a rush in this chapter. Be aware about strange answers if any.)

Comments

Owner

YIFAN WANG

Simple Linear 2nd ODE Solver GUI - A 2nd constant coefficient linear ODE solver with simple GUI using euler's method

This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.

Kaggle G2Net Gravitational Wave Detection : 2nd place solution

The 2nd Version Of Slothybot

Xview3 solution - XView3 challenge, 2nd place solution

This is the solution for 2nd rank in Kaggle competition: Feedback Prize - Evaluating Student Writing.

Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

MATLAB codes of the book "Digital Image Processing Fourth Edition" converted to Python

Toontown House CT Edition

The 7th edition of NTIRE: New Trends in Image Restoration and Enhancement workshop will be held on June 2022 in conjunction with CVPR 2022.

Experimental solutions to selected exercises from the book [Advances in Financial Machine Learning by Marcos Lopez De Prado]

Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

Exact Pareto Optimal solutions for preference based Multi-Objective Optimization

This repository allows you to anonymize sensitive information in images/videos. The solution is fully compatible with the DL-based training/inference solutions that we already published/will publish for Object Detection and Semantic Segmentation.

🏅 The Most Comprehensive List of Kaggle Solutions and Ideas 🏅

LeetCode Solutions https://t.me/tenvlad