← back
Stanford CS 224R · Homework 2 · Tabular Q-Learning

Q-Learning on a Gridworld from Absolute Zero

A 5×4 grid. Two goals at different distances. One agent that knows nothing. By the end: a working Q-table, a clear understanding of why a small change in step reward flips the optimal policy, and the foundation for every neural-net RL algorithm in this homework.

No prior RL assumed NumPy primer included Every line you wrote, decoded The reward-shaping lesson
Roadmap

What You'll Master

Chapter 01

The Gridworld Setup

Before we touch a hammer or a robot arm, we solve a simpler version of the same problem in a setting where you can hold every state in your head. That's the point of Problem 1.

The world: a 5×4 grid. The agent always starts at (0, 0) — bottom-left. There are two goal cells:

G1(4, 3)
S(0,0)
G2(4, 0)

Four actions: left, right, up, down. Moving off the edge keeps you in place but still costs you the per-step reward. Reaching either goal ends the episode.

Three reward scenarios

The reward function is parameterized by three numbers:

Reward r(s, a, s') = rstep + R1 · 1[s' = Goal 1] + R2 · 1[s' = Goal 2]

1[·] is the indicator function: 1 if true, 0 if false. So every step the agent gets rstep, plus a one-time bonus when it hits a goal.

ScenariorstepR1 (Goal 1)R2 (Goal 2)Optimal goal
1−1105Goal 1 (the far one)
2−2105Goal 2 (the near one)
3+111Neither — stall forever

Same algorithm, three different reward functions, three different optimal policies. The lesson of P1 is exactly this: the optimal policy is not determined by goal magnitude alone — the per-step reward reshapes the value landscape. We'll see why each scenario produces what it does.

Why this is the right warm-up

Tabular Q-learning has every property of "real" RL except one: function approximation. State is enumerable, actions are discrete, you can store every Q-value in a table. This means:

Once you internalize this version, the leap to neural Q-learning (Problem 3) and policy gradients (Problem 2) is just "replace the table with a network and add stabilization tricks."

Chapter 02

NumPy Primer

If you know Python but NumPy is fuzzy, read this. You'll need every concept here when reading the code.

1. ndarrays are just multi-dimensional Python lists, but fast

A NumPy array (ndarray) is a contiguous block of numbers with a shape. It supports vectorized math: operations apply elementwise without writing a for loop.

import numpy as np

a = np.array([1, 2, 3])         # shape (3,)
b = np.zeros((4, 5, 4))           # shape (4, 5, 4), all zeros
c = a * 10                          # [10, 20, 30] — vectorized

2. Shapes and indexing

The Q-table in this homework has shape (height, width, 4) = (4, 5, 4). So q_table is a 4×5×4 cube of numbers. Indexing pulls out a slice:

q_table = np.zeros((4, 5, 4))           # shape (4, 5, 4)

q_table[2]                              # shape (5, 4) — one whole y-row
q_table[2, 3]                           # shape (4,)  — the 4 Q-values at one cell
q_table[2, 3, 1]                        # scalar — one specific Q-value
The (x, y) vs (y, x) trap

The state is stored as (x, y) tuples (column, row) but the Q-table is indexed as q_table[y, x, action] (row, column, action). This is because NumPy follows row-major convention: the first axis is the slowest-varying. When the code does x, y = state followed by q_table[y, x, action], that's the unpack-then-flip-the-order pattern. Get this wrong and you train Q-values for the wrong cells without crashing — silent bug.

3. Aggregations: max, argmax, mean, sum

Two close-but-different operations:

q = np.array([0.1, 0.5, 0.3, 0.2])

np.max(q)            # 0.5  — the VALUE
np.argmax(q)         # 1    — the INDEX where the max lives

You almost always want argmax when picking actions: it returns "which action is best" (an integer), not "how good is the best action" (a float).

4. Random number generators

NumPy has two APIs for randomness. The old one (np.random.randint(...)) uses a global state — convenient but bad for reproducibility. The new one creates an explicit Generator:

rng = np.random.default_rng(seed=42)

rng.random()                # float in [0, 1)
rng.integers(4)             # int in {0, 1, 2, 3}, INCLUSIVE 0, EXCLUSIVE 4
rng.integers(low=2, high=5) # int in {2, 3, 4}

The training code creates a seeded Generator (rng = np.random.default_rng(scenario.seed)). When your code uses this rng object, runs are reproducible. If you accidentally use np.random.* directly, you bypass the seed and get different results every time.

rng.integers vs rng.integer

The method is integers with an s. (Plural — consistent with NumPy's "vectorized by default" mindset, even when generating one number.) rng.integer doesn't exist and will throw AttributeError. Easy typo to make.

5. The dataclass and frozen=True

The starter code uses Python dataclasses heavily:

from dataclasses import dataclass

@dataclass(frozen=True)
class Scenario:
    name: str
    goal_1_reward: float
    goal_2_reward: float
    step_reward: float
    episodes: int = 5000
    alpha: float = 0.2
    gamma: float = 0.98
    # ...

This is a structured record with fields. frozen=True makes instances immutable — you can't reassign scenario.alpha = 0.3 after construction. Useful for hyperparameters: ensures the algorithm code can't accidentally mutate them mid-training.

That's enough NumPy. Onward to the math.

Chapter 03

RL Foundations

Before deriving Q-learning, let's nail down the language. Five concepts.

Definition
State — s

A complete description of the world right now. In this gridworld, the state is just (x, y), the agent's grid position. The state has the Markov property: it contains everything you need to know to predict the next state and reward, regardless of history.

Definition
Action — a

What the agent does. Here: one of {left, right, down, up}, integer-coded as {0, 1, 2, 3}. The action is the agent's only way to influence the world.

Definition
Reward — r(s, a, s')

A scalar feedback signal: how good is this transition? The agent's only goal is to maximize total reward over time. We design the reward function to encode whatever we want the agent to do.

Definition
Policy — π(a | s)

A rule for choosing actions. Could be deterministic (a = f(s)) or stochastic (a ~ π(·|s)). In Q-learning, we derive the policy from the Q-table by argmax: pick the action with the highest Q-value.

Definition
Return — Gt

The total discounted reward from time t to the end:

Gt = rt + γ rt+1 + γ2 rt+2 + γ3 rt+3 + ...

The discount factor γ ∈ [0, 1) is typically 0.98 here. Rewards far in the future are worth less than rewards now.

Why discount?

Three reasons:

  1. Mathematical: without γ < 1, infinite-horizon returns can diverge. With γ < 1, the sum is bounded by rmax / (1 − γ).
  2. Practical: agents should prefer rewards now over later, all else equal — matches intuition about money, food, etc.
  3. Algorithmic: it makes the Bellman equation contract, which is what makes Q-learning provably converge.

The objective

RL boils down to one thing:

Objective Find π that maximizes 𝔼[Gt] = 𝔼[ rt + γ rt+1 + γ2 rt+2 + ... ]

The expectation accounts for randomness in policy and environment. We want a policy that performs well on average, not one that got lucky once.

The Q-function is a tool to find that optimal policy. Let's build it.

Chapter 04

The Bellman Equation

Q-learning rests on one beautiful equation. Here's how it falls out of the definitions.

From Gt to Q(s, a)

Define the Q-value as the expected return given an action:

Q-value Qπ(s, a) = 𝔼[ Gt | st = s, at = a, then follow π ] = 𝔼[ rt + γ rt+1 + γ2 rt+2 + ... | ... ]

Now the trick. Pull the first term out of the sum:

Step 1: Split the return Qπ(s, a) = 𝔼[ rt + γ (rt+1 + γ rt+2 + γ2 rt+3 + ...) ] = 𝔼[ rt + γ Gt+1 ]

The bracketed thing in the second term is just the return from time t+1 onward — which is itself a Q-value, evaluated at the next state and the next action. So:

Step 2: Recognize the recursion Qπ(s, a) = 𝔼[ r(s, a, s') + γ Qπ(s', a') ] where s' ~ env, a' ~ π(·|s')

This is the Bellman expectation equation. The Q-value of (s, a) decomposes into the immediate reward plus the discounted Q-value at the next step.

The Bellman optimality equation

If we want the optimal Q-function (corresponding to the optimal policy), we replace "follow π" with "always pick the best action":

Bellman optimality Q*(s, a) = 𝔼s'[ r(s, a, s') + γ maxa' Q*(s', a') ]

The maxa' says: from the next state, take whichever action gives the highest Q-value. This is what makes the equation an optimality condition — it specifies the value of always acting greedily from here forward.

Why this matters

The Bellman optimality equation is a fixed-point equation. The optimal Q* is the unique solution. If you can solve this equation — even approximately — you've solved the RL problem, because the optimal policy is just π*(s) = argmaxa Q*(s, a).

Two ways to solve it

MethodIdeaCaveat
Value iterationRepeatedly update Q[s, a] ← expected RHS, sweeping all statesRequires knowing the env transition function
Q-learningSample (s, a, r, s') tuples by acting; nudge Q[s, a] toward observed RHSDon't need transition probabilities; learn from experience

We don't know the env's transition function in general — we just observe samples by acting in it. So we use Q-learning. Next chapter.

Chapter 05

Q-Learning, Derived

Q-learning is the Bellman optimality equation, written as an update rule.

The naive idea

If we knew the right-hand side r + γ maxa' Q*(s', a') exactly, we'd just set Q[s, a] = RHS. Done.

We don't. We have a noisy sample of the RHS, computed from one observed transition. The trick: nudge Q[s, a] toward the observed RHS, with a learning rate α controlling how much.

Q-learning update Q(s, a) ← Q(s, a) + α [ r + γ maxa' Q(s', a') − Q(s, a) ]

This is the most important equation in this homework. Memorize it.

Anatomy of the update

Read it as three pieces:

So the update reads: "new Q = old Q + α · (how wrong we were)." Or equivalently:

Equivalent form Q(s, a) ← (1 − α) Q(s, a) + α · target

"New Q = blend of old Q (weight 1−α) and the new evidence (weight α)." Smoothly averaging old belief with new data. With α=0.2 (this homework's default), each update moves us 20% of the way toward the latest evidence.

The terminal-state edge case

If s' is terminal (the episode just ended), there's no future:

target = r if done = True target = r + γ maxa' Q(s', a') otherwise

In code, this is an if/else. If you forget and use the second formula even on terminal steps, you'd add γ maxa' Q(goal, a') — but the Q-values at the goal cell were never observed during a "from-goal" transition because the episode ended there. They're just zeros (or worse, learned junk from spurious updates). Forgetting the if/else is a real bug, not a stylistic concern.

Why does this work?

Theorem (Watkins, 1989): if every (s, a) pair is visited infinitely often and α is decayed appropriately, tabular Q-learning provably converges to Q* — the unique fixed point of the Bellman optimality equation.

Why? At convergence, the expected TD error is zero. That happens exactly when Q satisfies the Bellman optimality equation. So the algorithm is doing stochastic gradient descent on the Bellman residual, and it converges because the Bellman operator is a contraction.

The intuition that survives all the math

You're never told the right answer. You only see one-step samples of reality. Each sample lets you update one entry in your table by a little, in the direction of less error. Repeat 100,000 times. The errors average out, and your table converges to the truth.

Chapter 06

Exploration vs. Exploitation

The Q-learning update is one half of the algorithm. The other half: how do we choose which (s, a) to update? If we always pick the action with the highest current Q-value (greedy), we'll never try alternatives. We might miss a better path.

The dilemma

You need both. Pure greedy and you'll get stuck in suboptimal habits. Pure random and you'll never act on what you've learned.

ε-greedy: the simplest balance

ε-greedy policy With probability ε: a ~ uniform({left, right, up, down}) (explore) With probability 1−ε: a = argmaxa' Q(s, a') (exploit)

Set ε high early in training (lots of exploration) and decay it low later (mostly exploit what you've learned). This homework decays linearly from 0.4 down to 0.02 across 5000 episodes:

def epsilon_for_episode(scenario, episode_idx):
    fraction = episode_idx / max(1, scenario.episodes − 1)
    return scenario.epsilon_start + fraction * (scenario.epsilon_end − scenario.epsilon_start)

Why early exploration matters

At episode 0, the Q-table is all zeros. Argmax over zeros returns 0 (LEFT). Without exploration, the agent would pick LEFT every step, hit the left wall, never see a goal, and never learn anything. The Q-table would stay zero forever.

With ε=0.4, in 60% of states we still try-something-greedy, but in 40% we randomize. Over 20 timesteps per episode, that's plenty of stumbling-around-randomly to eventually trip into a goal. The first time it does, the goal reward propagates one cell into the Q-table. Next episode, the previous cell's Q-value is non-zero, so greedy starts working from there. It cascades.

Connection to the rest of HW2

Neural-net RL has the same problem on continuous actions. PPO (Problem 2) handles it via entropy bonus on a stochastic policy. Off-policy SAC (Problem 3) handles it via fixed-std Gaussian noise + replay buffer diversity. All three are forms of "make the policy stochastic enough to discover the reward."

Chapter 07

How Learning Propagates

This is the part of Q-learning that feels almost magical when you see it for the first time. The reward only happens at terminal cells — but somehow, after enough episodes, every cell in the path knows it's on a path to reward. How?

Episode 1: first reward seen

Initially, q_table is all zeros. The agent ε-greedy-walks for 20 steps, mostly randomly. By chance, on some episode it stumbles into Goal 1 from cell (4, 2) by going UP.

The transition: (4, 2) → UP → (4, 3), reward = −1 + 10 = 9, done = True.

The TD update with done:

target = 9 Q[(4,2), UP] ← 0 + 0.2 · (9 − 0) = 1.8

One cell in the table now has a non-zero Q-value. Everything else is still zero.

Episode 2: the value leaks one step backward

Suppose the agent reaches (3, 2) and takes RIGHT. Transition: (3, 2) → (4, 2), reward = −1, not done.

The TD update without done:

target = −1 + 0.98 · maxa' Q[(4,2), ·] = −1 + 0.98 · 1.8 UP is now the best action at (4,2) = 0.764 Q[(3,2), RIGHT] ← 0 + 0.2 · 0.764 = 0.153

Now (3, 2), RIGHT has a non-zero Q. The agent now "knows" that going right from (3, 2) is on a path to reward.

Episode N: the wave reaches the start

Repeat the propagation. Each episode, the boundary of "cells with non-zero Q" expands by roughly one cell toward the start. After enough episodes, every cell on the optimal path has a positive Q-value pointing toward the goal, geometrically decayed by γ per step.

At convergence, for cells along the optimal path:

Q*(start, optimal_action) ≈ γsteps · goal_reward + (per-step reward summed)

This is what makes Q-learning work. The reward at the goal is one bit of information per episode, but it propagates back through the table like a ripple, eventually labeling every relevant state with its discounted future return.

Visualization

If you printed the Q-table after every 100 episodes and color-coded by max-action-Q, you'd see a heatmap that starts dark everywhere, then a bright spot near the goal that expands outward like dye spreading through water. By episode 5000 the whole grid is illuminated and the gradient points cleanly toward the optimal goal.

Why the speed of propagation matters

One reward per episode = one TD error per episode = one Q-value updated by a meaningful amount per episode. The propagation is "1 cell per episode" in the worst case. With 5000 episodes and a path length of 7, that's plenty.

In a 100×100 grid with sparse rewards, you'd need orders of magnitude more episodes for the wave to reach all relevant cells. This is why neural-network RL on continuous environments often needs millions of steps — the "wavefront" of learned values has a longer journey to make.

Chapter 08

The Full Algorithm

Tabular Q-Learning with ε-Greedy Exploration
  1. Initialize Q-table to zeros: q_table = np.zeros((H, W, 4)).
  2. Initialize a seeded random number generator: rng = np.random.default_rng(seed).
  3. Outer loop: for episode = 1 to N_episodes:
    a) Reset state to start: state = env.start
    b) Compute current epsilon: ε = epsilon_for_episode(scenario, episode)
    c) Inner loop: for step = 1 to horizon:
    i. Pick action via ε-greedy: a = choose_action(q_table, state, ε, rng)
    ii. Step the env: s', r, done = env.step(state, a)
    iii. Compute TD target:
    if done: target = r
    else: target = r + γ · maxa' Q[s', a']
    iv. Update: Q[state, a] += α (target − Q[state, a])
    v. Move forward: state = s'
    vi. If done, break the inner loop.
  4. Evaluate: roll out the greedy policy from start, collect trajectory, log outcome.

That's it. Three nested layers (scenario × episode × step), one update equation, one exploration rule. Tabular Q-learning fits in 30 lines.

Chapter 09

Code Tour: gridworld_q_learning.py

Five components in the starter file: action constants, Scenario dataclass, GridWorld env, helper functions, and the training loop.

Action constants

gridworld_q_learning.py:8-12LEFT = 0
RIGHT = 1
DOWN = 2
UP = 3
ACTION_NAMES = {LEFT: "left", RIGHT: "right", DOWN: "down", UP: "up"}

Just integer codes for actions. You'll see q_table[..., RIGHT] in the code, which is the same as q_table[..., 1] but reads better.

The Scenario dataclass

gridworld_q_learning.py:15-28@dataclass(frozen=True)
class Scenario:
    name: str
    goal_1_reward: float
    goal_2_reward: float
    step_reward: float
    expected_outcome: str
    episodes: int = 5000
    horizon: int = 20
    alpha: float = 0.2           # learning rate
    gamma: float = 0.98          # discount factor
    epsilon_start: float = 0.4
    epsilon_end: float = 0.02
    seed: int = 0

Bundle of hyperparameters. Three pre-built scenarios at lines 39-61 differ only in goal_1_reward, goal_2_reward, step_reward, and expected_outcome.

The GridWorld env

gridworld_q_learning.py:64-98class GridWorld:
    def __init__(self, goal_1_reward, goal_2_reward, step_reward):
        self.start = (0, 0)
        self.goal_2 = (4, 0)
        self.goal_1 = (4, 3)
        self.width = 5
        self.height = 4
        self.goal_1_reward = goal_1_reward
        self.goal_2_reward = goal_2_reward
        self.step_reward = step_reward

    def step(self, state, action):
        x, y = state
        if action == LEFT:  x = max(0, x − 1)
        if action == RIGHT: x = min(self.width − 1, x + 1)
        if action == DOWN:  y = max(0, y − 1)
        if action == UP:    y = min(self.height − 1, y + 1)
        next_state = (x, y)

        reward = self.step_reward
        done = False
        if next_state == self.goal_2:
            reward += self.goal_2_reward
            done = True
        if next_state == self.goal_1:
            reward += self.goal_1_reward
            done = True
        return next_state, reward, done

Deterministic transitions. max(0, ...) and min(width-1, ...) implement the "stay in place when moving off-edge" rule. Reaching either goal sets done=True and adds the goal bonus.

A subtle thing about env.step

The env doesn't know that the goal is terminal. It just returns done=True and trusts the agent to stop. If the training loop forgets to break, the agent can keep stepping from the goal cell as if the episode hadn't ended — getting more step penalties, polluting Q-values for terminal cells. This is exactly the bug we'll discuss in Chapter 10.

Helper functions

gridworld_q_learning.py:107-110def action_values(q_table, state):
    x, y = state
    return q_table[y, x]    # shape (4,) — the Q-values for all 4 actions at this cell

Tiny but useful. Reads "give me the row of Q-values at this state" without making the (x, y) vs (y, x) flip explicit at every call site.

The blanks you fill in

Three blanks:

  1. greedy_action(q_table, state) — pick the argmax action.
  2. choose_action(q_table, state, epsilon, rng) — ε-greedy version.
  3. The body of the inner loop in train_q_learning — the actual Q-learning update.

Next chapter: every line you wrote, decoded.

Chapter 10

Your Three Changes, Decoded

The centerpiece. Every line you wrote, with a linenote block per line, explaining what each NumPy/Python operation does and why.

Change 1 of 3
greedy_action — deterministic best-action picker

What you wrote:

def greedy_action(q_table, state):
    q_values = action_values(q_table, state)
    action = int(np.argmax(q_values))
    return action

Line 1: q_values = action_values(q_table, state)

q_values = action_values(q_table, state)

Calls the helper from line 107. Returns the 4-element row of Q-values at this state.

Internally, action_values unpacks state = (x, y) and returns q_table[y, x] — shape (4,). So q_values[0] is Q for LEFT, q_values[1] is Q for RIGHT, etc.

Why use the helper instead of inlining q_table[y, x]? Three reasons: keeps the (x, y) vs (y, x) flip in one place, makes the intent clearer at the call site, and lets us swap the Q-storage convention later without rewriting everything.

Line 2: action = int(np.argmax(q_values))

action = int(np.argmax(q_values))

np.argmax(q_values) returns the index of the largest element — an integer in {0, 1, 2, 3}. This is the index of the best action, which is exactly what the env's step function expects.

The wrapper int(...) converts NumPy's np.int64 result to a plain Python int. Why does this matter? Because the env asserts action in (LEFT, RIGHT, DOWN, UP), which are plain Python ints. np.int64(0) in (0, 1, 2, 3) happens to return True due to numeric equality, but explicit casting is cleaner and removes any ambiguity if you later change the assert.

The function is typed -> int — the cast also matches the type signature.

Line 3: return action

return action

Trivial. Hand the integer back to the caller. rollout_policy uses this to pick actions during evaluation, and choose_action uses it inside its else-branch.

The bug we found earlier

Original first attempt: return np.max(action_values(q_table, state)). np.max returns the value (a float like 1.8), not the index. The env then asserts 1.8 in (0, 1, 2, 3), which is False, and raises AssertionError. The fix: argmax. Lesson: when picking actions, the question is "which one?" — that's argmax.

Edge case: what if multiple actions are tied?

At episode 0, all Q-values are 0. np.argmax([0, 0, 0, 0]) returns 0 (LEFT) deterministically — NumPy returns the first occurrence on ties. This is fine; choose_action will use ε-greedy to break out of the LEFT-only initial behavior.

Change 2 of 3
choose_action — ε-greedy

What you wrote:

def choose_action(q_table, state, epsilon, rng):
    if rng.random() < epsilon:
        action = int(rng.integers(4))
    else:
        action = greedy_action(q_table, state)
    return action

Line 1: if rng.random() < epsilon:

if rng.random() < epsilon:

Two things in one line.

1. rng.random() — samples a uniformly distributed float in [0, 1) from the seeded generator. (Range is half-open: 0 is possible, 1 is not.)

2. ... < epsilon — compares to the current epsilon. The probability of this being True is exactly epsilon, because that's the fraction of [0, 1) that lies below epsilon.

Why use rng.random() instead of np.random.random()? Reproducibility. The training script creates a seeded rng = np.random.default_rng(scenario.seed). As long as everywhere in the algorithm uses this same rng object, the entire run is deterministic for a given seed. If you slip and use the global np.random.* functions, you bypass the seed and runs become non-reproducible.

Line 2: action = int(rng.integers(4))

action = int(rng.integers(4))

rng.integers(4) returns a uniform random integer in {0, 1, 2, 3}. The argument 4 is the exclusive upper bound. (Behind-the-scenes call signature is rng.integers(low=0, high=4).)

The int(...) wraps it to a plain Python int, same reason as in greedy_action: matches the type signature and avoids any chance of trouble at the env's assert.

This branch fires with probability ε: the action is uniformly random across all 4 actions. Maximum exploration.

Lines 3-4: the else branch

else: action = greedy_action(q_table, state)

If we didn't take the random branch, fall through to greedy. We could have inlined the argmax here too, but calling greedy_action avoids duplication — if you later change how greedy works (e.g., add tie-breaking), you only edit one place.

This branch fires with probability 1 − ε. Maximum exploitation.

Line 5: return action

return action

Hand back the chosen action.

A subtle bug to avoid — sample order

An earlier version of this function pre-sampled the random action before the if-check, even when going greedy:

action = rng.integers(4)              # always sample, even if unused
if rng.random() < epsilon: return int(action)
...

This works! But it advances the rng state on every call, even when going greedy. Two implementations that are logically equivalent can produce different training results from the same seed. The cleaner version only samples when needed.

Change 3 of 3
The Q-learning training loop

What you wrote (inside the inner step loop):

action = choose_action(q_table, state, epsilon, rng)
next_state, reward, done = env.step(state, action)
if done:
    target = reward
else:
    target = reward + scenario.gamma * np.max(action_values(q_table, next_state))
x, y = state
q_table[y, x, action] += scenario.alpha * (target − q_table[y, x, action])
state = next_state
if done:
    break

Line 1: action = choose_action(q_table, state, epsilon, rng)

action = choose_action(q_table, state, epsilon, rng)

Call your ε-greedy function. Returns an integer in {0, 1, 2, 3}. The current epsilon was computed by epsilon_for_episode at the start of this episode — same value across all steps within the episode, decaying across episodes.

Line 2: next_state, reward, done = env.step(state, action)

next_state, reward, done = env.step(state, action)

Actually take the action in the environment. Three return values:

next_state — tuple (x', y'). The new agent position after the action.

reward — scalar. The per-step reward plus any goal bonus if we landed on a goal.

done — bool. True if next_state is a goal cell.

Python's tuple unpacking on the left-hand side names them in one line. The env is purely deterministic in this homework, so the same (state, action) always yields the same return.

Lines 3-6: TD target with terminal handling

if done: target = reward

Terminal case: the episode just ended. There is no future state to bootstrap from, so the target is just the observed reward. This is the most informative target of the entire episode — it's the only one with a non-bootstrapped, ground-truth value.

else: target = reward + scenario.gamma * np.max(action_values(q_table, next_state))

Non-terminal case: bootstrap. Three pieces:

reward — the immediate step reward (e.g. −1).

scenario.gamma — the discount factor, 0.98.

np.max(action_values(q_table, next_state)) — maxa' Q(s', a'). We look up the Q-values at next_state (a 4-vector) and take the max. This is the agent's own current estimate of "how good is the best thing I can do from next_state." It's biased (the table is imperfect), but it's the best we have.

The whole expression: "the reward I just got, plus my best guess of the future." This is the right-hand side of the Bellman optimality equation, evaluated at one sample.

Lines 7-8: the Q-update

x, y = state

Unpack the state tuple. Recall: state is (x, y), but Q-table indexing is [y, x, action]. We unpack here to make the (x, y)-to-(y, x) flip explicit on the next line.

q_table[y, x, action] += scenario.alpha * (target - q_table[y, x, action])

The Q-learning update, in NumPy.

Reading right to left:

q_table[y, x, action] — current Q estimate, a scalar.

(target - q_table[y, x, action]) — the TD error. How wrong was our current estimate?

scenario.alpha * (...) — scale the error by the learning rate (0.2). We move 20% of the way toward the new evidence.

+= — in-place addition. Equivalent to q_table[y, x, action] = q_table[y, x, action] + ....

This is one of the most important lines in the homework. Memorize the shape: old_value += alpha * (target - old_value). Every TD method — gridworld, neural Q-learning, SAC — uses this same update equation.

Line 9: state = next_state

state = next_state

Walk forward in time. The next iteration of the inner loop will use this updated state. Trivial Python variable rebinding — no NumPy involved.

Lines 10-11: the break-on-done

if done: break

Exits the inner step loop. This is essential.

Without the break, after reaching the goal, the next iteration of the inner loop would call env.step(next_state, action) with next_state being the goal cell. The env doesn't know about terminal-ness; it would happily step further and return (s'', step_reward, False). The agent would keep accumulating step rewards past the episode boundary, polluting Q-values for cells that should never have been visited from the goal.

Critical placement: the break is after the Q-update, not before. If you put the break inside the if done: target = reward block, the loop exits before the Q-update runs — meaning the terminal reward is computed but never written to the Q-table. The most informative update of the entire episode gets skipped, and learning fails completely. We caught this exact bug earlier.

The two-bug pattern from this assignment

Many students hit one of two bugs:

Forgot break entirely: agent keeps walking off the goal, racks up extra step penalties, total_reward at evaluation is wrong, and Q-values for cells visited after the goal are corrupted.

Break inside if done: terminal Q-update never happens, reward never propagates back, training never converges, evaluation produces "agent walks left into wall forever."

The fix is the same in both cases: one break, at the bottom of the loop, after the Q-update.

Putting it all together

The full inner loop, distilled:

One Q-learning step, in plain English
  1. Look at the Q-table; pick an action with ε-greedy.
  2. Take that action; observe what happened.
  3. Compute "what we now think Q(state, action) should be."
  4. Nudge Q(state, action) 20% of the way toward that new estimate.
  5. Move forward one step.
  6. If the episode ended, stop and start a new one.

Six lines of pseudocode. ~10 lines of actual code. That's the entire engine of tabular Q-learning.

Chapter 11

The Three Scenarios

Same algorithm, three different reward structures, three different optimal policies. The deliverable for Task 1.3 is to report which goal each scenario reaches and explain why in one sentence.

Scenario 1: rstep = −1, R1 = 10, R2 = 5

Compare the two paths:

Goal 1: 7 steps × (−1) + 10 = 3 Goal 2: 4 steps × (−1) + 5 = 1

Goal 1 wins by 2. The Q-table reflects this: at the start, RIGHT and UP have the highest values (both lie on a path toward Goal 1's diagonal).

Observed: trajectory through the diagonal toward Goal 1, total reward = 3. ✓

One-sentence explanation: Going to Goal 1 gives 7×(−1) + 10 = 3, while Goal 2 gives 4×(−1) + 5 = 1, so the agent learns Goal 1 is the better target.

Scenario 2: rstep = −2, R1 = 10, R2 = 5

Goal magnitudes unchanged. Step penalty doubled. Compare again:

Goal 1: 7 steps × (−2) + 10 = −4 Goal 2: 4 steps × (−2) + 5 = −3

Goal 2 wins by 1. The 3 extra steps to Goal 1 cost 6 (with the doubled penalty), wiping out the 5-point reward gap.

Observed: trajectory straight right to Goal 2, total reward = −3. ✓

One-sentence explanation: Now Goal 1 gives 7×(−2) + 10 = −4 but Goal 2 gives 4×(−2) + 5 = −3, so doubling the step cost flips which goal is better and the agent switches to Goal 2.

Scenario 3: rstep = +1, R1 = R2 = 1

Step reward is now positive. Compare again:

Reach Goal 2 (closest, 4 steps): 4 × (+1) + 1 = 5 Stall against the wall, 20 steps: 20 × (+1) + 0 = 20

Stalling wins by a lot. The agent learns to walk LEFT into the wall, soaking up +1 per step for the full horizon. Reaching a goal would terminate that profitable stream after just one bonus.

Observed: trajectory of 20 LEFT actions, agent stays at (0, 0) the whole time, total reward = 20. ✓

Q-values at start are all 50.0: that's the infinite-horizon discounted return of "+1 forever," computed exactly as 1 / (1 − γ) = 1 / 0.02 = 50. Q-learning converged to the optimal value of "stay alive forever."

One-sentence explanation: Reaching the closest goal (Goal 2 in 4 steps) gives 4×(+1) + 1 = 5, but stalling against the wall for the full horizon gives 20×(+1) + 0 = 20, so the agent learns to never finish the task because the +1 per step pays more than any goal bonus.

The big-picture lesson

Reward shaping is hard

Three small changes to the reward function produce three completely different optimal policies, including one (scenario 3) that refuses to do the task you wanted. This is called reward hacking — the agent finds a way to maximize the reward you specified, which is not the behavior you intended.

In real-world RL, reward design is one of the hardest engineering problems. Sparse rewards (P2/P3) are intentionally chosen to avoid this trap: with reward only at completion, the only way to get reward is to actually complete the task. The cost is that learning is slower because reward is rare.

The Q-values tell a story

You also reported the Q-values at the start state. Look at scenario 1:

start_q_values: (1.219, 2.265, 1.219, 2.265)
                LEFT   RIGHT  DOWN   UP

LEFT and DOWN are tied at 1.219 — both move the agent into a wall (LEFT off the left edge, DOWN off the bottom edge), keeping it in place but still costing the step penalty. RIGHT and UP are tied at 2.265 — both move toward Goal 1's "diagonal." The Q-table has correctly figured out the symmetry.

For scenario 2:

start_q_values: (-4.996, -3.057, -4.996, -4.329)
                 LEFT   RIGHT   DOWN    UP

RIGHT is now distinctly better than UP — because the path to Goal 2 (going right four times) is now the optimal path. UP would head toward the now-suboptimal Goal 1. The Q-table has figured out that the policy should change.

This is why looking at the actual numbers after training is so satisfying with tabular Q-learning. You can directly inspect the values it learned and verify they encode the right preferences.

Chapter 12

Cheat Sheet & Self-Quiz

Equations to memorize

Bellman optimality Q*(s, a) = 𝔼[ r(s, a, s') + γ maxa' Q*(s', a') ]
Q-learning update Q(s, a) ← Q(s, a) + α [ r + γ maxa' Q(s', a') − Q(s, a) ] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TD error
ε-greedy policy π(a | s) = uniform(actions) with prob ε π(a | s) = argmaxa' Q(s, a') with prob 1 − ε

Hyperparameter reference

SymbolCodeValueWhat it does
αscenario.alpha0.2Learning rate — how much the Q-table moves per update
γscenario.gamma0.98Discount factor — how much the future is worth
εstartscenario.epsilon_start0.4Initial random-action probability
εendscenario.epsilon_end0.02Final random-action probability
episodesscenario.episodes5000Number of training episodes
horizonscenario.horizon20Max steps per episode
seedscenario.seed0Random seed for reproducibility

Self-quiz — if you can answer these without re-reading, you're ready

  1. Why does Q-learning need to track actions in the table, not just states?
  2. What does the discount factor γ do, and what would happen with γ=1 in scenario 3?
  3. Why is the Q-table initialized to zeros and not random values?
  4. What is the difference between np.max and np.argmax? When do you want each?
  5. Why does the inner loop need if done: break? Where does the break go?
  6. If you forgot the if done in the TD target and always used the bootstrap formula, what would happen?
  7. Why is ε high at the start of training and low at the end?
  8. What does the equation Q ← Q + α (target − Q) mean intuitively?
  9. Suppose rstep = 0 in scenario 3. What policy would the agent learn? Why?
  10. If the agent never explored (ε = 0 throughout), what would happen?
  11. What does rng.integers(4) return, and why use rng instead of np.random?
  12. Why does the Q-table use shape (height, width, 4) and not (width, height, 4)?
Answer key

1. Because the policy needs to know which action is best at each state. A V(s)-only table would say "this cell is good" but not "go RIGHT from here." With Q(s, a), greedy action selection is just argmax over the action axis.

2. γ trades off immediate vs. future rewards. With γ=1 in scenario 3, the infinite-horizon stalling return would diverge to infinity, the Q-values would never converge, and the algorithm would be ill-defined. γ < 1 makes the sum bounded.

3. Zero is a neutral starting point that doesn't bias exploration. With random init, some actions would have spuriously high values that exploitation would lock onto early, before exploration could correct them. (In neural Q-learning we sometimes need careful initialization, but tabular benefits from the simpler zero start.)

4. np.max returns the largest value (a number); np.argmax returns the index of the largest value (an integer). For action selection you always want argmax. For computing the bootstrap target r + γ maxa' Q(s', a'), you want max.

5. Because the env doesn't know about terminal states — it'll keep stepping if you keep calling step(). The break exits the inner loop on terminal so we don't pollute Q-values for cells visited after the goal. The break goes after the Q-update, not before, so the terminal reward gets written to the table.

6. You'd add γ max Q(s_goal, a') to the target on terminal steps. Initially Q(goal, *) = 0, so the target would equal the reward and you'd be lucky. But once the agent visits states after the goal (because of the missing break), Q(goal, *) becomes non-zero and biases all future targets upward.

7. Early on, the Q-table is uninformative, so greedy action selection is essentially random and we should mix in real randomness to discover reward. Late in training, Q is accurate, so we should mostly trust it. Linearly decaying ε gets both phases.

8. "Move the current estimate 20% of the way toward the new evidence." Equivalently: Q = (1−α) old_Q + α new_target. It's a smoothed running average of one-sample TD targets.

9. With rstep=0 and equal goal rewards of 1, both goals give equal total return regardless of path length. The agent would learn to reach some goal, but which one depends on epsilon-greedy randomness and the seed. Almost certainly Goal 2 (closer) due to early random exploration finding it first.

10. At episode 0 with Q all zeros, argmax always returns LEFT (action 0). Without exploration, the agent walks LEFT into the wall every step, never sees a goal, never updates the Q-table. Training fails completely.

11. rng.integers(4) returns a uniform random integer in {0, 1, 2, 3} (4 is exclusive upper bound). We use the seeded rng for reproducibility — the global np.random bypasses the seed.

12. NumPy is row-major: the first axis is the slowest-varying. The natural way to print or visualize a grid is row-by-row, so putting height first lets q_table[y] give you a whole row, which matches how grids are typically rendered.

Running it

cd /Users/ozyphus/Documents/GitHub/cs224r/homework/hw2/hw2
python gridworld_q_learning.py

Should print three blocks (one per scenario), each with expected/observed outcome, trajectory, actions, total reward, and start Q-values. If observed_outcome matches expected_outcome for all three, your implementation is correct.

Take it back to class

You can now teach this

Three big ideas, in order of importance:

  1. The Bellman equation. Q(s, a) decomposes into immediate reward plus discounted future Q. This recursion is the entire engine of TD learning — tabular, neural, on-policy, off-policy, all of them.
  2. The Q-learning update. Q ← Q + α (target − Q). Same equation, scaled up to neural nets, becomes DQN, SAC, and every value-based method in modern RL.
  3. Reward shaping is hard. The optimal policy is sensitive to small changes in the reward function. Scenario 3 is the canonical example of "the agent did exactly what you said, not what you wanted."

If a friend asks: "How does Q-learning work?" — you say: "It's stochastic gradient descent on the Bellman residual. You don't know the right Q-values, but every transition gives you a noisy sample of the right-hand side. Move toward that sample by α each time. Repeat enough and the Q-table converges to the optimal Q-function. The two tricks: discount future rewards by γ to make sums finite, and use ε-greedy to ensure every state-action gets sampled."

You can teach this. On to Problem 2.