The Oh No! - Lab¶
…Our dilemma, and three ways to learn to maximize reward.
In this assignment we study exploration, with a cliff. We will take advantage of the simulator in The Paths Perspective on Value Learning to ask some basic but important questions about how learning rules, and exploration interact.
Link: https://distill.pub/2019/paths-perspective-on-value-learning/
The learning rate \(\alpha\) is fixed. All we can control is the degree of exploration (\(\epsilon\)), with a nice little slider.
Our agents of interest are Monte Carlo, SARSA, and Q-learning. These are our three ways to the most reward.
The lab has two sections.
First I quiz you with questions about both S & B, and the text in The Paths Perspective on Value Learning, and how they relate.
Second, is where we simulate exploration-exploitation and see how the same set of choices lead to quite different learning and interpretations of the same experiences (aka transition sets).
Section 1 - Oh no, a quiz!¶
Yikes!¶
Question 1.1¶
Like they did in S & B in Section 1, diagram out a tree diagram and planning route for the Cliff World, starting from the bottom left position. Do this on a separate piece of paper, which you will take a picture of and upload later. Start off drawing out the game board, then imagine taking a random move, and write the tree of available choices. Keep going until you map out all the ways to get to the winning “+2” grid box.
(Note: write small?)
# (Upload your sketch on Canvas)
Question 1.2¶
Looking at the game “from above”, when you can see all the paths at once, as it is shown in the article, makes the game look simple. After diagramming above, does the cliff game seem different to you? How challenging a game is for our agents.
# Write your answer here as a comment. Explain yourself.
Question 1.3¶
Put another way–how would you change the state representation to learn more of a “birds eye view”, like the one you have when reading. Think about alternatives to state-as-a-single-location perhaps?
# Write your answer here as a comment. Explain yourself.
Question 1.4¶
Discuss the difficulties your proposal in the last question might face? Computational explosions? Time limits? Bias? Tell me a downside or cost to your approach, and if you can how you might change things to overcome that?
# Write your answer here as a comment. Explain yourself.
Question 1.5¶
Is the exploration problem easier or harder do you think for your scheme?
# Write your answer here as a comment. Explain yourself.
Question 1.6¶
Provide an example of exploration and reward learning from your experience, or from the natural world at large, that fits well into the markov decision space abstraction.
# Write your answer here as a comment. Explain yourself.
Question 1.6¶
Don’t take my word for it. Do you, should we, buy Markov decision space as valid for biological problems en large? Should we tear down reinforcement, and start again? If so, what would we do instead? Speculate. Imagine. Half baked ideas welcome here.**
**If you love this idea, please come do research with me.
# Write your answer here as a comment. Explain yourself.
Question 1.7¶
In most of the class, besides the beginning, I’ve had us working in grid worlds, or other discrete settings. Imagine if I asked you to redraw the cliff game planning tree, but this time as an open continuous field. If I were to ask you to draw the tree from Q1.1 you quickly find it impossible. Explain why
# Write your answer here as a comment. Explain yourself.
Question 1.8¶
But the world is really a continuous place, right? Describe one way a biological exploring agent might cope, and cite any evidence you know of in support of this idea.
# Write your answer here as a comment. Explain yourself.
Question 1.9¶
All our agents rely on an idea called policy iteration. S & B described it at length. In your own words, explain how it operates in the cliff game. Assume in your answer the initial values V(s) for sarsa are zero.
Note: the tree you made in Q1.1 may help you think this though. And remember we’re averaging over several episodes/experiments.
# Write your answer here as a comment. Explain yourself.
Question 1.10¶
In the Distill article they say:
One of the key sub-problems of RL is value estimation – learning the long-term consequences of being in a state. This can be tricky because future returns are generally noisy, affected by many things other than the present state.
Does this statement violate the Markov definition?
# Write your answer here as a comment. Explain yourself.
Section 2 - Oh no, a cliff!¶
Our overall aim in this section is to study and learn how to best balance explore-exploit tendencies, as well as examining how the same level of exploration leads to different learning outcomes in three different agents.
Our algs of interest are Monte Carlo, SARSA, and Q-learning.
Our overall metric of success in this section amounts to, “How well do we learn to play in the Cliff World?”. We will assess this by studying:
Are the V(s) right for each grid box
Are the each of the Q(s,a) right for each grid box
Do the policy arrows point up “towards” the +2 winning grid box? How many go the wrong direction?
Note: Examples of “right” or otherwise good answers are shown in the Appendix below.
Yikes!¶
Question 2.1¶
Refresh the page. Set the explore-slider to the middle. Run 20 agents (you can do this quick by pressing the button quick). Count how many V, Q, and policy arrows are wrong. Report these numbers below separately Monte Carlo, SARSA, and Q-learning.
# Put Monte Carlo results as a comment here
# Put SARSA results as a comment here
# Put Q-learning results as a comment here
Question 2.2¶
Repeat the steps on Q1.1 but set the explore-slider far to the left. Run 20 agents (you can do this quick by pressing the button quick). Count how many V, Q, and policy arrows are wrong. Report these numbers below.
# Put Monte Carlo results as a comment here
# Put SARSA results as a comment here
# Put Q-learning results as a comment here
Question 2.3¶
Repeat the steps on Q1.1 but set the explore-slider far to the right. Run 20 agents (you can do this quick by pressing the button quick). Count how many V, Q, and policy arrows are wrong. Report these numbers below.
# Put Monte Carlo results as a comment here
# Put SARSA results as a comment here
# Put Q-learning results as a comment here
Question 2.4¶
Compare and contrast the results from the three questions above this one. Which level of explore-exploit seems to be doing the best?
# Write your answer here as a comment
Question 2.5¶
Based on the results from Q2.1-2.3, choose a new position on the explore-exploit slider that you think can do better.
To find out, refresh the page. Set the explore-slider where you want it. Run 20 agents (you can do this quick by pressing the button quick). Count how many V, Q, and policy arrows are wrong. Report these numbers below.
# Write the position you choose (as best you can) here
# Put Monte Carlo results as a comment here
# Put SARSA results as a comment here
# Put Q-learning results as a comment here
# Were you right? Explain why or why not here as a comment
Question 2.6¶
Consider this lab as a whole. If you had to choose one agent—Monte Carlo, SARSA, or Q-learning—as your only personal learning algorithm, which would you choose?
# Write your answer here as a comment. Explain yourself.
Appendix¶
Here is an example of good or “correct” V values for each grid box.
Here is an example of good or “correct” Q values for each grid box.
Here is an example of good or “correct” policy arrow for each grid box.