The What Dilemma - Lab¶

Introduction¶

In this assignment we take on all the things, err, agents, that we have ever studied. And two new ones, to boot!

This is the final lab.

Space crab is a sad crab….

The decisions to be made this week are the exact opposite of every other lab.

I am giving you six tuned agents, and three “levers” which control the environment. The now familiar scent grid. Your job this week is to tweak the environment, until each of the agents is the winning agents, in two senses.

Our target metrics:

It must gather the most total reward, by a clear margin (error baar overlap)
It must not die the most. That is, as long as one other agent dies more often, or all agents die 0 times, we’ll call that good enough. (Any experimental trial which does not lead to finding at least a single target (aka reward) means the exploring agent dies. It’s a harsh noisy world we live in, after all.)

Once again, on final time it’s time for taxic explorations. We revisit the sniff world (aka ScentGrid) with a familiar twist. We look again at what happens when sense information is not just noisy, but suddenly missing altogether. A concrete, cheap to simulate, case of this is turbulent flows.

Sections¶

There are two sections to this Lab. In the first we get to know WSLS, as well as a pure RL agent. In the second, we explore and change the environment itself.

The env levers¶

There are three “levers’’ you may put to use:

num_targets = (1, 1000) # these are the allowed bounds
noise_sigma = (0.0, 10)
cog_mult = (1, 10)

Our agents, this time¶

We will study six agents. They are,

A diffusion walker (aka rando-taxis) (aka DiffusionGrid)
Sniff! (aka chemo-taxis) (aka GradientDiffusionGrid)
Air cognition! (aka “smart” chemo-taxis) (aka AccumulatorGradientGrid)
Info cognition! (aka “smart” info-taxis) (aka AccumulatorInfoGrid)
RL w/ random softmax search (aka ActorCriticGrid)
Curiosity and RL union (aka rewardo- and info-taxis) (aka WSLSGrid)

The goal is, as I said, to the change the world – until each agent “wins” (defined above).

Our agents, in review¶

Random search (rando-taxis): Actions are sampled from an exponential distribution. For the randotaxis agent number of steps means the number of steps or actions the agent takes.

Sniff! (chemo-taxis): Recall our basic model of E. Coli exploration is as simple as can be.

When the gradient is positive, meaning you are going “up” the gradient, the probability of turning is set to p pos.
When the gradient is negative, the turning probability is set to p neg. (See code below, for an example).
If the agent “decides” to turn, the direction it takes is uniform random.
The length of travel before the next turn decision is sampled from an exponential distribution just like the DiffusionGrid

Costly cognition (“smart”, chemo- and info-taxis): Both chemo- and infotaxis agents will use a DDM-style accumulator to try and make better decisions about the direction of the gradient. These decisions are of course statistical in nature. (We won’t be tuning the accumulator parameters in this lab. Assume the parameters I give you, for the DDM, are “good enough”.)

As in the Air Quotes Lab we will assume that the steps are in a sense conserved. For the other two (accumulator) agents a step can mean two things. For accumulator agents a step can be spent sampling/weighing noisy scent evidence in the same location, or it can be spent moving to a new location. Note: Even though the info-accumulator is more complex, it can take advantage of missing scent information to drive its behavior. It can also use positive scent hits, of course, too.

RL (rewardo-taxis): A Q-learning agent with softmax exploration. Recall: this is the same kind of agent we studied in the Cliff task, in the The Oh No! - Lab.

The RL agent has no shaping function, or intrinsic reward. It does not use the scent, in other words.

WSLS (rewardo- and info-taxis): A agent that alternates between info-taxis and Q-learning. Both are deterministic. Exploration and exploitation without any random search, in other words.

Details: For this model a memory $M$ is a discrete probability distribution. I define information value $E$ on the norm of the derivative ($\nabla M), approximated by $\hat E = || f(x, M) - M ||$, where $||.||$ denotes the norm. (Norms are distances like hypotanooses.)

The goal of any info-taxis (aka, curiosity agent) is to maximize $E$, I claim, based on a Bellman-optimal policy $\pi^*_E$.

So armed with $\hat E$ I write down another (meta) policy $\pi^{\pi}$, in terms of a mixed series of values, $\hat E$ and environmental rewards $R$. This WSLS rule is shown below. The reward (exploit) policy $\pi_R$ is Q learning, same as for RL.

\[\begin{split} \begin{split} \Pi_{\pi} = \begin{cases} \pi^*_{\hat{E}} & : \hat{E} - \eta > R + \rho \\ \pi_R & : \hat{E} - \eta < R + \rho \\ \end{cases} \end{split} \end{split}\]

Our TED talk moment¶

AKA, Your moment in the sun!

AAKA, Nine new Powers are born!

AAAKA, Change the world, my students!

Let each species (agent) know the sweet comfort of utter ecological dominance. …or try to… I do not promise victory is always possible.

Install and import needed modules¶

# Install explorationlib?
!pip install --upgrade git+https://github.com/parenthetical-e/explorationlib
!pip install --upgrade git+https://github.com/MattChanTK/gym-maze.git

import shutil
import glob
import os

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from copy import deepcopy

import explorationlib
from explorationlib.local_gym import ScentGrid

from explorationlib.agent import WSLSGrid
from explorationlib.agent import CriticGrid
from explorationlib.agent import SoftmaxActor
from explorationlib.agent import DiffusionGrid
from explorationlib.agent import GradientDiffusionGrid
from explorationlib.agent import AccumulatorGradientGrid
from explorationlib.agent import AccumulatorInfoGrid
from explorationlib.agent import ActorCriticGrid

from explorationlib.run import experiment
from explorationlib.util import select_exp
from explorationlib.util import load
from explorationlib.util import save

from explorationlib.local_gym import uniform_targets
from explorationlib.local_gym import constant_values
from explorationlib.local_gym import ScentGrid
from explorationlib.local_gym import create_grid_scent
from explorationlib.local_gym import add_noise
from explorationlib.local_gym import create_grid_scent_patches

from explorationlib.plot import plot_position2d
from explorationlib.plot import plot_length_hist
from explorationlib.plot import plot_length
from explorationlib.plot import plot_targets2d
from explorationlib.plot import plot_scent_grid
from explorationlib.plot import plot_targets2d

from explorationlib.score import total_reward
from explorationlib.score import num_death

# Pretty plots
%matplotlib inline
%config InlineBackend.figure_format='retina'
%config IPCompleter.greedy=True
plt.rcParams["axes.facecolor"] = "white"
plt.rcParams["figure.facecolor"] = "white"
plt.rcParams["font.size"] = "16"

# Dev
%load_ext autoreload
%autoreload 2

Section 1 - RL and WSLS¶

RL¶

To build some intuition, let’s plot the the behavoir of our RL agent as it learns where the rewards are in a (fixed) ScentGrid env. The noise level is 2 standard deviaions, all but 10 percent of it deleted.

Question 1.1¶

Does the fact that

The noise level of the scents is 2 standard deviaions, and all but 10 percent of it deleted.

matter for the RL agent?

# Write your answer here as a comment. Explain yourself.

Shared params and env¶

Section 1

# Noise and delete
p_scent = 0.1
noise_sigma = 2.0

# Shared 
num_experiments = 100
num_steps = 200
seed_value = 5838
num_targets = 20 # with 80 agents are more competitive!

# ! (leave alone)
detection_radius = 1
cog_mult = 1
max_steps = 1
min_length = 1
target_boundary = (10, 10)

# Targets
prng = np.random.RandomState(seed_value)
targets = uniform_targets(num_targets, target_boundary, prng=prng)
values = constant_values(targets, 1)

# Scents
scents = []
for _ in range(len(targets)):
    coord, scent = create_grid_scent_patches(
        target_boundary, p=1.0, amplitude=1, sigma=2)
    scents.append(scent)

# Env
env = ScentGrid(mode=None)
env.seed(seed_value)
env.add_scents(targets, values, coord, scents, noise_sigma=noise_sigma)

Getting to know you, RL¶

…and a random walker reference

# RL
possible_actions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
critic = CriticGrid(default_value=0.5)
actor = SoftmaxActor(num_actions=4, actions=possible_actions, beta=4)
rl = ActorCriticGrid(actor, critic, lr=0.1, gamma=0.1)

# Rando
diff = DiffusionGrid(min_length=min_length, scale=1)
diff.seed(seed_value)

# !
rl_exp = experiment(
    f"RL",
    rl,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
rand_exp = experiment(
    f"rand",
    diff,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)

Rando search¶

Just one example, for comparison with the cells below

plot_boundary = (20, 20)

# -
num_experiment = 99
ax = None
ax = plot_position2d(
    select_exp(rand_exp, num_experiment),
    boundary=plot_boundary,
    label=f"Rando",
    color="grey",
    alpha=0.6,
    ax=ax,
)
ax = plot_targets2d(
    env,
    boundary=plot_boundary,
    color="black",
    alpha=1,
    label="Targets",
    ax=ax,
)

Search behavoir, and learning¶

At three experimental time points, $N$.

plot_boundary = (20, 20)

# -
num_experiment = 0
ax = None
ax = plot_position2d(
    select_exp(rl_exp, num_experiment),
    boundary=plot_boundary,
    label=f"N={num_experiment}",
    color="orange",
    alpha=0.3,
    ax=ax,
)
num_experiment = 50
ax = plot_position2d(
    select_exp(rl_exp, num_experiment),
    boundary=plot_boundary,
    label=f"N={num_experiment}",
    color="orange",
    alpha=0.5,
    ax=ax,
)
num_experiment = 99
ax = plot_position2d(
    select_exp(rl_exp, num_experiment),
    boundary=plot_boundary,
    label=f"N={num_experiment}",
    color="orange",
    alpha=1,
    ax=ax,
)
ax = plot_targets2d(
    env,
    boundary=plot_boundary,
    color="black",
    alpha=1,
    label="Targets",
    ax=ax,
)

Reward value, in time¶

At three experimental time points, $N$.

fig = plt.figure(figsize=(6, 3))
plt.plot(rl_exp[0]["agent_reward_value"], label="N=0", color="orange", alpha=0.2)
plt.plot(rl_exp[50]["agent_reward_value"], label="N=50", color="orange", alpha=0.5)
plt.plot(rl_exp[99]["agent_reward_value"], label="N=99", color="orange", alpha=1)
plt.ylabel("Value $V(x)$")
plt.xlabel("Step")
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

Death¶

# Results
results = [rand_exp, rl_exp]
names = ["Rando", "RL"]
colors = ["grey", "orange"]

# Score by eff
scores = []
for name, res, color in zip(names, results, colors):
    scores.append(num_death(res))   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(4, 3))
plt.bar(names, m, yerr=sd, color="black", alpha=0.6)
plt.ylabel("Deaths")
plt.tight_layout()
sns.despine()

Total reward¶

# Results
results = [rand_exp, rl_exp]
names = ["Rando", "RL"]
colors = ["grey", "orange"]

# Score by eff
scores = []
for name, res, color in zip(names, results, colors):
    r = total_reward(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(3, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Total reward")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(6, 3))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.5, bins=np.linspace(0, np.max(scores), 50))
    plt.legend()
    plt.xlabel("Score")
    plt.tight_layout()
    sns.despine()

Is it really better to WSLS?¶

# WSLS
possible_actions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
num_action = len(possible_actions)
initial_bins = np.linspace(0, 1, 10)

critic_R = CriticGrid(default_value=0.0)
critic_E = CriticGrid(default_value=np.log(num_action))
actor_R = SoftmaxActor(num_actions=4, actions=possible_actions, beta=20) 
actor_E = SoftmaxActor(num_actions=4, actions=possible_actions, beta=20)

wsls = WSLSGrid(
    actor_E,
    critic_E,
    actor_R,
    critic_R,
    initial_bins,
    lr=0.1,
    gamma=0.1,
    boredom=0.0
)

# Rando
diff = DiffusionGrid(min_length=min_length, scale=1)
diff.seed(seed_value)

# !
wsls_exp = experiment(
    f"wsls",
    wsls,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
rand_exp = experiment(
    f"rand",
    diff,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)

plot_boundary = (20, 20)

# -
num_experiment = 99
ax = None
ax = plot_position2d(
    select_exp(rand_exp, num_experiment),
    boundary=plot_boundary,
    label=f"Rando",
    color="grey",
    alpha=0.6,
    ax=ax,
)
ax = plot_targets2d(
    env,
    boundary=plot_boundary,
    color="black",
    alpha=1,
    label="Targets",
    ax=ax,
)

plot_boundary = (20, 20)

# -
num_experiment = 0
ax = None
ax = plot_position2d(
    select_exp(wsls_exp, num_experiment),
    boundary=plot_boundary,
    label=f"N={num_experiment}",
    color="orangered",
    alpha=0.3,
    ax=ax,
)
num_experiment = 50
ax = plot_position2d(
    select_exp(wsls_exp, num_experiment),
    boundary=plot_boundary,
    label=f"N={num_experiment}",
    color="orangered",
    alpha=0.5,
    ax=ax,
)
num_experiment = 99
ax = plot_position2d(
    select_exp(wsls_exp, num_experiment),
    boundary=plot_boundary,
    label=f"N={num_experiment}",
    color="orangered",
    alpha=1,
    ax=ax,
)
ax = plot_targets2d(
    env,
    boundary=plot_boundary,
    color="black",
    alpha=1,
    label="Targets",
    ax=ax,
)

Reward value, in time¶

At three experimental time points, $N$.

fig = plt.figure(figsize=(6, 3))
plt.plot(wsls_exp[0]["agent_reward_value"], label="N=0", color="orangered", alpha=0.2)
plt.plot(wsls_exp[50]["agent_reward_value"], label="N=50", color="orangered", alpha=0.5)
plt.plot(wsls_exp[99]["agent_reward_value"], label="N=99", color="orangered", alpha=1)
plt.ylabel("Value $V(x)$")
plt.xlabel("Step")
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

Death¶

# Results
results = [rand_exp, wsls_exp]
names = ["Rando", "WSLS"]
colors = ["grey", "orangered"]

# Score by eff
scores = []
for name, res, color in zip(names, results, colors):
    scores.append(num_death(res))   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(3, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.9)
plt.ylabel("Deaths")
plt.tight_layout()
sns.despine()

Total reward¶

# Results
results = [rand_exp, wsls_exp]
names = ["Rando", "WSLS"]
colors = ["grey", "orangered"]

# Score by eff
scores = []
for name, res, color in zip(names, results, colors):
    r = total_reward(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(3, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Total reward")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(6, 4))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.5, bins=np.linspace(0, np.max(scores), 50))
    plt.legend()
    plt.xlabel("Score")
    plt.tight_layout()
    sns.despine()

So, is it better to be curious and greedy, or greedy and noisy?¶

A comparison between RL and WSLS (and rando)

Death¶

# Results
results = [rand_exp, rl_exp, wsls_exp]
names = ["Rando", "RL", "WSLS"]
colors = ["grey", "orange", "orangered"]

# Score by eff
scores = []
for name, res, color in zip(names, results, colors):
    scores.append(num_death(res))   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(4, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.9)
plt.ylabel("Deaths")
plt.tight_layout()
sns.despine()

Total reward¶

# Results
results = [rand_exp, rl_exp, wsls_exp]
names = ["Rando", "RL", "WSLS"]
colors = ["grey", "orange", "orangered"]

# Score by eff
scores = []
for name, res, color in zip(names, results, colors):
    r = total_reward(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(3, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Total reward")
plt.tight_layout()
sns.despine()

# Dists
fig = plt.figure(figsize=(6, 4))
for (name, s, c) in zip(names, scores, colors):
    plt.hist(s, label=name, color=c, alpha=0.5, bins=np.linspace(0, np.max(scores), 50))
    plt.legend()
    plt.xlabel("Score")
    plt.tight_layout()
    sns.despine()

Question 1.2¶

The WSLS approach should have generated more total reward. It may also have had a few (< 10) deaths. (If it did not, try running the WSLS cells again).

Likewise, if you study WSLS search behavior and value learning time courses, you’ll see it “settles down” to one rewarding spot and can stay there.

In other words, WSLS is a method with very high inductive bias.

A theme of this class has been, “bias is great… until it is not”.

Based on the results in this lab so far, and lecture on WSLS, how could you change the env so that the exploration bias behind WSLS (deterministic learning maximization) fails, but the random search of RL does not.

Note: It is helpful to consider the total reward distribution plots carefully. The middle and the bottom range, especially. (Try rerunning?)

Note: Everything is on the table. Your counter-example can be whatever you want, well as long as it is physically possible. Be imaginative!

# Write your answer here as a comment. Explain yourself.

Section 2¶

Let’s remake the world….

All our agents¶

Run on the same world from Section 1. An example to see where things stand. To give you a place to start in your world building.

Intial (reference) env¶

Section 2

# Noise and delete
p_scent = 0.1
noise_sigma = 2.0

# Shared 
num_experiments = 100
num_steps = 200
seed_value = 5838
num_targets = 20 # with 80 agents are more competitive!

# ! (leave alone)
detection_radius = 1
cog_mult = 1
max_steps = 1
min_length = 1
target_boundary = (10, 10)

# Targets
prng = np.random.RandomState(seed_value)
targets = uniform_targets(num_targets, target_boundary, prng=prng)
values = constant_values(targets, 1)

# Scents
scents = []
for _ in range(len(targets)):
    coord, scent = create_grid_scent_patches(
        target_boundary, p=1.0, amplitude=1, sigma=2)
    scents.append(scent)

# Env
env = ScentGrid(mode=None)
env.seed(seed_value)
env.add_scents(targets, values, coord, scents, noise_sigma=noise_sigma)

Run ‘em all!¶

# Agents

# rando
diff = DiffusionGrid(min_length=min_length, scale=1)
diff.seed(seed_value)

# sniff
sniff = GradientDiffusionGrid(
    min_length=min_length, 
    scale=1.0, 
    p_neg=1, 
    p_pos=0.0
)
sniff.seed(seed_value)

# smart chemo
chemo = AccumulatorGradientGrid(
    min_length=min_length, 
    max_steps=max_steps, 
    drift_rate=1, 
    threshold=3,
    accumulate_sigma=1
)
chemo.seed(seed_value)

# smart info
info = AccumulatorInfoGrid(
    min_length=min_length, 
    max_steps=max_steps, 
    drift_rate=1, 
    threshold=3,
    accumulate_sigma=1
)
info.seed(seed_value)

# RL
critic = CriticGrid(default_value=0.5)
actor = SoftmaxActor(num_actions=4, actions=possible_actions, beta=4)
rl = ActorCriticGrid(actor, critic, lr=0.1, gamma=0.1)

# WSLS
possible_actions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
num_action = len(possible_actions)
initial_bins = np.linspace(0, 1, 10)

critic_R = CriticGrid(default_value=0.5)
critic_E = CriticGrid(default_value=np.log(num_action))
actor_R = SoftmaxActor(num_actions=4, actions=possible_actions, beta=20) 
actor_E = SoftmaxActor(num_actions=4, actions=possible_actions, beta=20)

wsls = WSLSGrid(
    actor_E,
    critic_E,
    actor_R,
    critic_R,
    initial_bins,
    lr=0.1,
    gamma=0.1,
    boredom=0.0
)

# !
rand_exp = experiment(
    f"rand",
    diff,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
sniff_exp = experiment(
    f"sniff",
    sniff,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
chemo_exp = experiment(
    f"chemo",
    chemo,
    env,
    num_steps=num_steps * cog_mult,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
info_exp = experiment(
    f"info",
    info,
    env,
    num_steps=num_steps * cog_mult,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
rl_exp = experiment(
    f"rl",
    rl,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
wsls_exp = experiment(
    f"wsls",
    wsls,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)

Search behavoir¶

Experiment 99

plot_boundary = (20, 20)
num_experiment = 99

# Results
results = [sniff_exp, chemo_exp, info_exp, rand_exp, rl_exp, wsls_exp]
names = ["Sniff", "Chemo", "Info", "Rando", "RL", "WSLS"]
colors = ["purple", "blue", "green", "grey", "orange", "orangered"]

for name, res, color in zip(names, results, colors):
    ax = None
    ax = plot_position2d(
        select_exp(res, num_experiment),
        boundary=plot_boundary,
        label=f"{name}",
        color=color,
        alpha=0.6,
        ax=ax,
    )
    ax = plot_targets2d(
        env,
        boundary=plot_boundary,
        color="black",
        alpha=1,
        label="Targets",
        ax=ax,
    )

Death¶

# Results
results = [sniff_exp, chemo_exp, info_exp, rand_exp, rl_exp, wsls_exp]
names = ["Sniff", "Chemo", "Info", "Rando", "RL", "WSLS"]
colors = ["purple", "blue", "green", "grey", "orange", "orangered"]

# Score by eff
scores = []
for name, res, color in zip(names, results, colors):
    scores.append(num_death(res))   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(6, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Deaths")
plt.tight_layout()
sns.despine()

Total reward¶

# Results
results = [sniff_exp, chemo_exp, info_exp, rand_exp, rl_exp, wsls_exp]
names = ["Sniff", "Chemo", "Info", "Rando", "RL", "WSLS"]
colors = ["purple", "blue", "green", "grey", "orange", "orangered"]

# Score 
scores = []
for name, res, color in zip(names, results, colors):
    r = total_reward(res)
    scores.append(r)   

# Tabulate
m, sd = [], []
for (name, s, c) in zip(names, scores, colors):
    m.append(np.mean(s))
    sd.append(np.std(s))

# Plot means
fig = plt.figure(figsize=(6, 3))
plt.bar(names, m, yerr=sd, color=colors, alpha=0.6)
plt.ylabel("Total reward")
plt.tight_layout()
sns.despine()

# Dists
# fig = plt.figure(figsize=(7, 5))
for (name, s, c) in zip(names, scores, colors):
    fig = plt.figure(figsize=(7, 3))
    plt.hist(s, label=name, color=c, alpha=0.4, bins=np.linspace(0, np.max(scores), 50))
    plt.legend()
    plt.xlabel("Score")
    plt.tight_layout()
    sns.despine()

Change the world!¶

I am giving you three parameters (aka levers) which can change which agent dominates the others. In the above reference, for example, RL and WSLS dominate.

By dominate I mean has:

The most total reward
Not the most deaths (a weaker criterion)

The parameters, and the acceptable ranges, are:

num_targets = (1, 1000)  # these are the allowed bounds
noise_sigma = (0.0, 10)
cog_mult = (1, 10)

Assignment¶

Use the reference code below to answer the questions which follow it. That is, along with the plotting functions used throughout.

If you cannot find an env that let’s the agent in question dominate, report the best results you can.

Reference code¶

# ---
# Change me
num_targets = 20 
cog_mult = 1
noise_sigma = 2.0

# ---
# Shared (leave alone)
num_experiments = 100
num_steps = 200
seed_value = 5838
detection_radius = 1
p_scent = 0.1
max_steps = 1
min_length = 1
target_boundary = (10, 10)

# Targets
prng = np.random.RandomState(seed_value)
targets = uniform_targets(num_targets, target_boundary, prng=prng)
values = constant_values(targets, 1)

# Scents
scents = []
for _ in range(len(targets)):
    coord, scent = create_grid_scent_patches(
        target_boundary, p=1.0, amplitude=1, sigma=2)
    scents.append(scent)

# Env
env = ScentGrid(mode=None)
env.seed(seed_value)
env.add_scents(targets, values, coord, scents, noise_sigma=noise_sigma)

# Agents

# rando
diff = DiffusionGrid(min_length=min_length, scale=1)
diff.seed(seed_value)

# sniff
sniff = GradientDiffusionGrid(
    min_length=min_length, 
    scale=1.0, 
    p_neg=1, 
    p_pos=0.0
)
sniff.seed(seed_value)

# smart chemo
chemo = AccumulatorGradientGrid(
    min_length=min_length, 
    max_steps=max_steps, 
    drift_rate=1, 
    threshold=3,
    accumulate_sigma=1
)
chemo.seed(seed_value)

# smart info
info = AccumulatorInfoGrid(
    min_length=min_length, 
    max_steps=max_steps, 
    drift_rate=1, 
    threshold=3,
    accumulate_sigma=1
)
info.seed(seed_value)

# RL
critic = CriticGrid(default_value=0.5)
actor = SoftmaxActor(num_actions=4, actions=possible_actions, beta=4)
rl = ActorCriticGrid(actor, critic, lr=0.1, gamma=0.1)

# WSLS
possible_actions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
num_action = len(possible_actions)
initial_bins = np.linspace(0, 1, 10)

critic_R = CriticGrid(default_value=0.5)
critic_E = CriticGrid(default_value=np.log(num_action))
actor_R = SoftmaxActor(num_actions=4, actions=possible_actions, beta=20) 
actor_E = SoftmaxActor(num_actions=4, actions=possible_actions, beta=20)

wsls = WSLSGrid(
    actor_E,
    critic_E,
    actor_R,
    critic_R,
    initial_bins,
    lr=0.1,
    gamma=0.1,
    boredom=0.0
)

# !
rand_exp = experiment(
    f"rand",
    diff,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
sniff_exp = experiment(
    f"sniff",
    sniff,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
chemo_exp = experiment(
    f"chemo",
    chemo,
    env,
    num_steps=num_steps * cog_mult,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
info_exp = experiment(
    f"info",
    info,
    env,
    num_steps=num_steps * cog_mult,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
rl_exp = experiment(
    f"rl",
    rl,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)
wsls_exp = experiment(
    f"wsls",
    wsls,
    env,
    num_steps=num_steps,
    num_experiments=num_experiments,
    dump=False,
    split_state=True,
    seed=seed_value
)

Question 2.1¶

What enviromental parameters, from the ranges just given above, lead the rando agent to dominate the others. AKA DiffusionGrid.

# Put your best code/result here. 
# To prove domination - show me bar plots, and distribution plots to make your case!

Question 2.2¶

Explain why you think these parameters were best. Or, if you could not make the agent dominate, explain why you could not as best you can.

# Write your answer here as a comment. Explain yourself.

Question 2.3¶

What enviromental parameters, from the ranges just given above, lead the sniff! agent to dominate the others. AKA GradientDiffusionGrid.

# Put your best code/result here. 
# To prove domination - show me bar plots, and distribution plots to make your case!

Question 2.4¶

Explain why you think these parameters were best. Or, if you could not make the agent dominate, explain why you could not as best you can.

# Write your answer here as a comment. Explain yourself.

Question 2.5¶

What enviromental parameters, from the ranges just given above, lead the smart-chemo agent to dominate the others. AKA AccumulatorGradientGrid.

# Put your best code/result here. 
# To prove domination - show me bar plots, and distribution plots to make your case!

Question 2.6¶

Explain why you think these parameters were best. Or, if you could not make the agent dominate, explain why you could not as best you can.

# Write your answer here as a comment. Explain yourself.

Question 2.7¶

What enviromental parameters, from the ranges just given above, lead the smart-info agent to dominate the others. AKA AccumulatorInfoGrid.

# Put your best code/result here. 
# To prove domination - show me bar plots, and distribution plots to make your case!

Question 2.8¶

Explain why you think these parameters were best. Or, if you could not make the agent dominate, explain why you could not as best you can.

# Write your answer here as a comment. Explain yourself.

Explorations!?

The What Dilemma - Lab¶

Introduction¶

Sections¶

The env levers¶

Our agents, this time¶

Our agents, in review¶

Our TED talk moment¶

Install and import needed modules¶

Section 1 - RL and WSLS¶

RL¶

Question 1.1¶

Shared params and env¶

Getting to know you, RL¶

Rando search¶

Search behavoir, and learning¶

Reward value, in time¶

Death¶

Total reward¶

Is it really better to WSLS?¶

Reward value, in time¶

Death¶

Total reward¶

So, is it better to be curious and greedy, or greedy and noisy?¶

Death¶

Total reward¶

Question 1.2¶

Section 2¶

All our agents¶

Intial (reference) env¶

Run ‘em all!¶

Search behavoir¶

Death¶

Total reward¶

Change the world!¶

Assignment¶

Reference code¶

Question 2.1¶

Question 2.2¶

Question 2.3¶

Question 2.4¶

Question 2.5¶

Question 2.6¶

Question 2.7¶

Question 2.8¶