The Follow the Directed - Lab

In this assignment we see what follows when one follows directions, in exploration. We see what happens when exploration is not totally random, but is directed by an information bonus of some kind. The two kinds are–no surprise–the UCB, and a new idea for us, novelty. They will be compared using a softmax Actor. Later on we will re-introduce the bounded sequential (pure) explorer from Be Best.

Recall one of the readings this week:

Ng, A., Harada, D. & Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning 278–287 (1999).

Ng’s point, or really the impact of his work on us, even if he never made the point directly, is that common bonuses used in directed exploration schemes, for humans and agents, can introduce a statistical bias to our ability to accurately measure the value of actions.

AKA Does inductive bias imply statistical bias? Here, anyway.

(Note: This is not true in general, or for all systems. It is sometimes true, and it is very important to know when.)

The reward value \(Q\) update rule for all agents (below):

\[ Q \leftarrow Q + \alpha * (\hat R - Q) \]

where \(\hat R = R + \lambda I\) and \(I\) is a standin for the information bonus. The equation for UCB follows. In it \(t\) is the number of steps in an episode, and \(N_a\) is the number of times that action has been taken.

\[\sqrt{log(t)/ N_a)}\]

If an action hasn’t been tried very often, or not at all, the \(N_a\) will be small and the uncertainty which UCB reflects will be large, or relatively large anyway. The larger the UCB is, when it is treated as an intrinsic reward, the more likely that action will be accepted.

There is not real reason to write an equation for the novelty bonus. It works like this. If this is the right time taking that action, add a \(\lambda\) to the reward \(R\), in the update rule above. Otherwise, do nothing. THis one time bonus littera vale everywhere, making it likely all the actions will be explored, at least a little.

(Pssssst - there is good evidence that such novelty signals exist in both human, monkey, and mouse brains).

The action policy, aka the Actor, will use the softmax sampling policy, whose free parameter \(\beta\) controls how soft the softmax is. Larger values make it harder. Aka, morre like a pure max.

The questions to answer this week are:

  1. Does “good” (inductive) bias cause bad (statistical) bias in two directed schemes? If so, how much?

  2. Does this bad bias even matter? Or really, can it matter in simple examples that look common enough in the real world?

The setting is a four-bandit. Our last time with these robbers.[**] The lab has two sections.

First we get to know our new agents, comparing rewards to value error. Doing some tuning.

Second we test how much this matters by testing how well our agents recover when the best choice “runs out” of rewards and changes to become the worst choice.

In this lab we use two metrics. The familiar total reward, and a new error (RMSE) that measures the difference between the true expected value of each action, to the values that were learned by each agent. Entropy will show up too.

# Install explorationlib?
!pip install --upgrade git+https://github.com/parenthetical-e/explorationlib
!pip install --upgrade git+https://github.com/MattChanTK/gym-maze.git
# Import 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Env
import explorationlib
from explorationlib.local_gym import BanditUniform4
from explorationlib.local_gym import BanditChange4

# Actors and critics
from explorationlib.agent import BanditActorCritic
from explorationlib.agent import Critic
from explorationlib.agent import CriticUCB
from explorationlib.agent import CriticNovelty
from explorationlib.agent import SoftmaxActor
from explorationlib.agent import DeterministicActor
from explorationlib.agent import BoundedSequentialActor

# Exp
from explorationlib.run import experiment
from explorationlib.util import select_exp
from explorationlib.util import load
from explorationlib.util import save

# Metrics of interest
from explorationlib.score import total_reward
from explorationlib.score import bandit_rmse
from explorationlib.score import action_entropy

# Vis
from explorationlib.plot import plot_bandit
from explorationlib.plot import plot_bandit_actions
from explorationlib.plot import plot_bandit_critic
from explorationlib.plot import plot_bandit_hist

Section 1 - The Directed

Soft explorations

The bigger \(\beta\) in softmax exploration, the more greedy or exploitative, the agent will be. To build some intuition, let’s look at some examples. Note: there is no bonus here.

Our bandit for this demo is:

# Env
seed_value = 60
env = BanditUniform4()
env.seed(seed_value)

# -
ax = plot_bandit(env, alpha=0.6)

Let’s plot example behavoir for three experiments, at some different levels of \(\beta\) (Don’t stray from these, at least in the questions; play all you want of course).

num_experiments = 3
num_steps = 4 * 60  # 60 steps / arm; a lot

betas = [2, 4, 6, 8]  
results = []
for beta in betas:
    # (SoftmaxActor is our general reference)
    ref = BanditActorCritic(
        SoftmaxActor(num_actions=env.num_arms, beta=beta),
        Critic(num_inputs=env.num_arms)
    )
    # !
    log = experiment(
        f"demo_{beta}",
        ref,
        env,
        num_steps=num_steps,
        num_experiments=num_experiments,
        dump=False,
        split_state=False,
    )
    results.append(log)

Visualize the effect of beta, one experiment per cell (three cells). Look over them all, please.

Experiment 0

num_experiment = 0
for name, res in zip(betas, results):
    plot_bandit_actions(
        select_exp(res, num_experiment), 
        max_steps=num_steps,
        s=4,
        title=f"Beta: {name} (N={num_experiment})", 
        color="black",
        figsize=(6,2)
    )

Experiment 1

num_experiment = 1
for name, res in zip(betas, results):
    plot_bandit_actions(
        select_exp(res, num_experiment), 
        max_steps=num_steps,
        s=4,
        title=f"Beta: {name} (N={num_experiment})", 
        color="black",
        figsize=(6,2)
    )

Experiment 2

num_experiment = 2
for name, res in zip(betas, results):
    plot_bandit_actions(
        select_exp(res, num_experiment), 
        max_steps=num_steps,
        s=4,
        title=f"Beta: {name} (N={num_experiment})", 
        color="black",
        figsize=(6,2)
    )

Question 1.1

In the paper below, the authors offered evidence that people tune their use of information bonuses, and their level of noise, depending on the task, the information available, and the horizon for exploration they have to work with. So…

To prove the point, let’s do the opposite. Running no more simulations, and knowing nothing about how much reward the above returned, make a best guess for the \(\beta\) value you believe can be a fair and robust choice for the entire lab to come. Keep in mind the brief description I gave you of Section 2, and the exploration bonuses will be in play soon.

Hint: Arm 2 is the most valuable choice. It is the “best” arm.

Warning: Don’t cheat. Make a guess. Explain why. And you’ll do fine grading wise

# Write your answer below as a code cell. Explain your choice here as a comment.
beta = 2.0 # change me? (My choice of 2.0 is not a hint. I picked it at random.

Let’s compare performance between the reference SoftmaxActor using a plain old Critic, and a critic using UCB bonuses CriticUCB and a CriticNovelty bonus agent?

Question 1.2

Make a guess, will adding a bonus increase of decrease the total rewards collected compared to the raw critic?

# Write your answer here as comment. Explain yourself.

Question 1.3

Make a guess, will CriticUCB or CriticNovelty do better?

Hint: CriticUCB provides a running bonus of how uncertain we may be about each arm. CriticNovelty is a one-off bonus at the start. Before answering, consider how much of a hint an agent may need on this task, and how much noise you are using (how small your beta is)?

Consider the factors in the hint in explaining your answer?

# Write your answer here as comment. Explain yourself.

Well, let’s see what happens….

num_experiments = 100
bonus_weight = 0.5

# Agents
ref = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    Critic(num_inputs=env.num_arms)
)
# UCB
ucb = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticUCB(num_inputs=env.num_arms, bonus_weight=bonus_weight)
)
# Novely
nov = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticNovelty(
        num_inputs=env.num_arms, 
        novelty_bonus=1.0,
        bonus_weight=bonus_weight
    )
)

# -
agents = [ref, ucb, nov]
names = ["softmax", "softmax-ucb", "softmax-nov"]
colors = ["blue", "green", "purple"]

# !
results = []
for name, agent in zip(names, agents):
    log = experiment(
        f"{name}",
        agent,
        env,
        num_steps=num_steps,
        num_experiments=num_experiments,
        dump=False,
        split_state=False,
    )
    results.append(log)

Example behave

num_experiment = 10
for name, res, color in zip(names, results, colors):
    plot_bandit_actions(
        select_exp(res, num_experiment), 
        max_steps=120,
        s=4,
        title=name, 
        color=color,
        figsize=(6,2)
    )

Value

# Score
scores = []
for name, res, color in zip(names, results, colors):
    r = total_reward(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(5, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Total reward")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(8, 3))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.4, bins=list(range(0, num_steps, 2)))
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xlabel("Total reward")
    plt.tight_layout()
    sns.despine()

Question 1.4

Which of the agents in Q1.3 will have the most entropy, and how will this relate the distribution of total reward we plotted in that question?

# Write your answer here as comment. Explain yourself.

Let’s see!

# Score
scores = []
for name, res, color in zip(names, results, colors):
    r = action_entropy(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(5, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Entropy")
plt.xlabel("Epsilon")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(8, 3))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.4, bins=np.linspace(0, 1.5, 50))
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xlabel("Entopy")
    plt.tight_layout()
    sns.despine()

Question 1.5

Given we are holding noise (\(\beta\)) constant, is it fair or unfair to consider entropy here to be a direct measure of the “directedness” of the exploration?

# Write your answer here as comment. Explain yourself.

Let’s increase the purity of our experiments. Let’s add in the BoundedSequentialActor with a normal Critic. I’ll pick an ambitious bound for us. No need to tune it. Assume I did a good and fair job in my choice

Question 1.6

Will BoundedSequentialActor do better or worse than all the other intelligent direct agents we have been playing with? Make a guess, based on the results from the Be Best lab.

# Write your answer here as comment. Explain yourself.

Let’s find out….

bonus_weight = 0.5

# Agents
ref = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    Critic(num_inputs=env.num_arms)
)
# UCB
ucb = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticUCB(num_inputs=env.num_arms, bonus_weight=bonus_weight)
)
# Novely
nov = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticNovelty(
        num_inputs=env.num_arms, 
        novelty_bonus=1.0,
        bonus_weight=bonus_weight
    )
)
seq = BanditActorCritic(
    BoundedSequentialActor(num_actions=env.num_arms, bound=20),
    Critic(num_inputs=env.num_arms)
)

# -
agents = [ref, ucb, nov, seq]
names = ["softmax", "softmax-ucb", "softmax-nov", "b-sequential"]
colors = ["blue", "green", "purple", "grey"]

# !
results = []
for name, agent in zip(names, agents):
    log = experiment(
        f"{name}",
        agent,
        env,
        num_steps=num_steps,
        num_experiments=num_experiments,
        dump=False,
        split_state=False,
    )
    results.append(log)
# Score
scores = []
for name, res, color in zip(names, results, colors):
    r = total_reward(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(5, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Total reward")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(8, 3))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.4, bins=list(range(0, num_steps, 2)))
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xlabel("Total reward")
    plt.tight_layout()
    sns.despine()

Question 1.7

Given the results in Q1.6, when we plot entropy below do you think you should revise your prediction for how entropy and total rewards relate? If you do revise it, explain.

# Write your answer here as comment. Explain yourself.

Let’s find out….

# Score
scores = []
for name, res, color in zip(names, results, colors):
    r = action_entropy(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(5, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Entropy")
plt.xlabel("Epsilon")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(8, 3))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.4, bins=np.linspace(0, 1.5, 50))
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xlabel("Entopy")
    plt.tight_layout()
    sns.despine()

Question 1.8

Were you right? If not, please try and explain why.

Also, was the entropy of the BoundedSequentialActor more or less than you expected? Do you understand why?

# Write your two answers here as comment. Explain yourself.

Let’s measure the error for the values we learned in Q1.7.

Question 1.9

Rank the models you expect to have the most error, to the least error. If you think one, or more, models will be about the same, that is ok.

One answer to this question could be BoundedSequentialActor > Critic = CriticUCB = CriticNovelty, but this is not the right answer. Just helping you with the form I want the answer to take.

# Write your answer here as comment. Explain yourself.

Question 1.10

Beyond a simple ranking, how do you think the differences in error will scale evenly with the difference in total reward, which often are not that large in these kinds of simple bandit experiments.

Will the change in error be about linear with total reward, or a lot more, or a lot less?

# Write your answer here as comment. Explain yourself.

Let’s find out, by measure the RMSE between the bandits true value and what each agent believes about it

# Score
scores = []
for name, res, color in zip(names, results, colors):
    r = bandit_rmse(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(5, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Error")
plt.tight_layout()
sns.despine()

# Dists
bins = np.linspace(0, np.max(scores), 40)
fig = plt.figure(figsize=(8, 4))
for (name, s, c) in zip(names, scores, colors):    
    plt.hist(s, label=name, color=c, alpha=0.4, bins=bins)
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xlabel("Error")
    plt.tight_layout()
    sns.despine()

Question 1.11

Does your answer to Q1.10 look about right? If your guess was off, please try and explain why.

# Write your answer here as comment. Explain yourself.

Let’s make one big change! Let’s change the horizon from 240 steps, to 60, and see how that changes total reward, and error.

Question 1.12

Do you think a change to horizon will affect the ranking of the models? Why or why not?

# Write your answer here as comment. Explain yourself.
num_steps = 60
bonus_weight = 0.5

# Agents
ref = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    Critic(num_inputs=env.num_arms)
)
# UCB
ucb = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticUCB(num_inputs=env.num_arms, bonus_weight=bonus_weight)
)
# Novely
nov = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticNovelty(
        num_inputs=env.num_arms, 
        novelty_bonus=1.0,
        bonus_weight=bonus_weight
    )
)
seq = BanditActorCritic(
    BoundedSequentialActor(num_actions=env.num_arms, bound=20),
    Critic(num_inputs=env.num_arms)
)

# -
agents = [ref, ucb, nov, seq]
names = ["softmax", "softmax-ucb", "softmax-nov", "b-sequential"]
colors = ["blue", "green", "purple", "grey"]

# !
results = []
for name, agent in zip(names, agents):
    log = experiment(
        f"{name}",
        agent,
        env,
        num_steps=num_steps,
        num_experiments=num_experiments,
        dump=False,
        split_state=False,
    )
    results.append(log)
# Score
scores = []
for name, res, color in zip(names, results, colors):
    r = total_reward(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(5, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Total reward")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(8, 3))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.4, bins=list(range(0, num_steps, 2)))
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xlabel("Total reward")
    plt.tight_layout()
    sns.despine()
# Score
scores = []
for name, res, color in zip(names, results, colors):
    r = bandit_rmse(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(5, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Error")
plt.tight_layout()
sns.despine()

# Dists
bins = np.linspace(0, np.max(scores), 40)
fig = plt.figure(figsize=(8, 4))
for (name, s, c) in zip(names, scores, colors):    
    plt.hist(s, label=name, color=c, alpha=0.4, bins=bins)
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xlabel("Error")
    plt.tight_layout()
    sns.despine()

Question 1.14

Was your answer to Q1.13 correct? Why or why not?

# Write your answer here as comment. Explain yourself.

Section 2 - the world changes

Can we recover when the best rewards run out?

In this section our task will start off the same as in Section 1. After 60 steps however, the best arm (arm 2) will become the worst. The question is who can recover? And does that recovery matter if they did a really good job collecting reward before the changes

Question 2.1

Keeping in mind the change to come, which agent will do the best overall?

To answer first imagine we run 80 steps in total, 20 steps past the change. Then imagine we run 120 steps in total. In the third part of your answer imagine we run 240 steps (this is the amount we have been using so far. It is 180 steps past the change point, when the best reward runs out).

# Write your answer here as comment. Explain yourself.

Let’s find out. Not the values below of the Env before and after then change….

# Experiment settings
# For all keep the maze the saame
num_experiments = 100
num_change = 60
seed_value = 60

# Env
env = BanditChange4(num_change=num_change)
env.seed(seed_value)
plot_bandit(env.orginal, alpha=0.6, title="Original")
plot_bandit(env.change, alpha=0.6, title="Change")

80 steps

num_steps = 80
bonus_weight = 0.5

# Agents
ref = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    Critic(num_inputs=env.num_arms)
)
# UCB
ucb = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticUCB(num_inputs=env.num_arms, bonus_weight=bonus_weight)
)
# Novely
nov = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticNovelty(
        num_inputs=env.num_arms, 
        novelty_bonus=1.0,
        bonus_weight=bonus_weight
    )
)
seq = BanditActorCritic(
    BoundedSequentialActor(num_actions=env.num_arms, bound=20),
    Critic(num_inputs=env.num_arms)
)

# -
agents = [ref, ucb, nov, seq]
names = ["softmax", "softmax-ucb", "softmax-nov", "b-sequential"]
colors = ["blue", "green", "purple", "grey"]

# !
results = []
for name, agent in zip(names, agents):
    log = experiment(
        f"{name}",
        agent,
        env,
        num_steps=num_steps,
        num_experiments=num_experiments,
        dump=False,
        split_state=False,
    )
    results.append(log)
# Score
scores = []
for name, res, color in zip(names, results, colors):
    r = total_reward(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(5, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Total reward")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(8, 3))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.4, bins=list(range(0, num_steps, 2)))
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xlabel("Total reward")
    plt.tight_layout()
    sns.despine()

120 steps

num_steps = 120
bonus_weight = 0.5

# Agents
ref = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    Critic(num_inputs=env.num_arms)
)
# UCB
ucb = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticUCB(num_inputs=env.num_arms, bonus_weight=bonus_weight)
)
# Novely
nov = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticNovelty(
        num_inputs=env.num_arms, 
        novelty_bonus=1.0,
        bonus_weight=bonus_weight
    )
)
seq = BanditActorCritic(
    BoundedSequentialActor(num_actions=env.num_arms, bound=20),
    Critic(num_inputs=env.num_arms)
)

# -
agents = [ref, ucb, nov, seq]
names = ["softmax", "softmax-ucb", "softmax-nov", "b-sequential"]
colors = ["blue", "green", "purple", "grey"]

# !
results = []
for name, agent in zip(names, agents):
    log = experiment(
        f"{name}",
        agent,
        env,
        num_steps=num_steps,
        num_experiments=num_experiments,
        dump=False,
        split_state=False,
    )
    results.append(log)
# Score
scores = []
for name, res, color in zip(names, results, colors):
    r = total_reward(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(5, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Total reward")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(8, 3))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.4, bins=list(range(0, num_steps, 2)))
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xlabel("Total reward")
    plt.tight_layout()
    sns.despine()

240 steps

num_steps = 240
bonus_weight = 0.5

# Agents
ref = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    Critic(num_inputs=env.num_arms)
)
# UCB
ucb = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticUCB(num_inputs=env.num_arms, bonus_weight=bonus_weight)
)
# Novely
nov = BanditActorCritic(
    SoftmaxActor(num_actions=env.num_arms, beta=beta),
    CriticNovelty(
        num_inputs=env.num_arms, 
        novelty_bonus=1.0,
        bonus_weight=bonus_weight
    )
)
seq = BanditActorCritic(
    BoundedSequentialActor(num_actions=env.num_arms, bound=20),
    Critic(num_inputs=env.num_arms)
)

# -
agents = [ref, ucb, nov, seq]
names = ["softmax", "softmax-ucb", "softmax-nov", "b-sequential"]
colors = ["blue", "green", "purple", "grey"]

# !
results = []
for name, agent in zip(names, agents):
    log = experiment(
        f"{name}",
        agent,
        env,
        num_steps=num_steps,
        num_experiments=num_experiments,
        dump=False,
        split_state=False,
    )
    results.append(log)
# Score
scores = []
for name, res, color in zip(names, results, colors):
    r = total_reward(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(5, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Total reward")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(8, 3))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.4, bins=list(range(0, num_steps, 2)))
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xlabel("Total reward")
    plt.tight_layout()
    sns.despine()

This lab has shown you directed exploration can improve total rewards collected, but at the cost of sometimes large errors, and thaat sometimes this error can limit performance on long horizons, when the world changes.

Question 2.2

Imagine in our final question that you, intelligent agent you are, could pick and choose among these four agents to be your strategies, but in an adaptive way and as you see fit. Please write down for a situation in which each of the agents might be the best choice.

These examples could be in bandit tasks, in a cliff worlds, or in open fields. Integrate the environments, in other words.

# Write your answer here as comment. Explain yourself.