A 5×4 grid. Two goals at different distances. One agent that knows nothing. By the end: a working Q-table, a clear understanding of why a small change in step reward flips the optimal policy, and the foundation for every neural-net RL algorithm in this homework.
Before we touch a hammer or a robot arm, we solve a simpler version of the same problem in a setting where you can hold every state in your head. That's the point of Problem 1.
The world: a 5×4 grid. The agent always starts at (0, 0) — bottom-left. There are two goal cells:
(4, 3), 7 steps away (4 right, 3 up).(4, 0), 4 steps away (4 right).Four actions: left, right, up, down. Moving off the edge keeps you in place but still costs you the per-step reward. Reaching either goal ends the episode.
The reward function is parameterized by three numbers:
1[·] is the indicator function: 1 if true, 0 if false. So every step the agent gets rstep, plus a one-time bonus when it hits a goal.
| Scenario | rstep | R1 (Goal 1) | R2 (Goal 2) | Optimal goal |
|---|---|---|---|---|
| 1 | −1 | 10 | 5 | Goal 1 (the far one) |
| 2 | −2 | 10 | 5 | Goal 2 (the near one) |
| 3 | +1 | 1 | 1 | Neither — stall forever |
Same algorithm, three different reward functions, three different optimal policies. The lesson of P1 is exactly this: the optimal policy is not determined by goal magnitude alone — the per-step reward reshapes the value landscape. We'll see why each scenario produces what it does.
Tabular Q-learning has every property of "real" RL except one: function approximation. State is enumerable, actions are discrete, you can store every Q-value in a table. This means:
Once you internalize this version, the leap to neural Q-learning (Problem 3) and policy gradients (Problem 2) is just "replace the table with a network and add stabilization tricks."
If you know Python but NumPy is fuzzy, read this. You'll need every concept here when reading the code.
A NumPy array (ndarray) is a contiguous block of numbers with a shape. It supports vectorized math: operations apply elementwise without writing a for loop.
import numpy as np a = np.array([1, 2, 3]) # shape (3,) b = np.zeros((4, 5, 4)) # shape (4, 5, 4), all zeros c = a * 10 # [10, 20, 30] — vectorized
The Q-table in this homework has shape (height, width, 4) = (4, 5, 4). So q_table is a 4×5×4 cube of numbers. Indexing pulls out a slice:
q_table = np.zeros((4, 5, 4)) # shape (4, 5, 4) q_table[2] # shape (5, 4) — one whole y-row q_table[2, 3] # shape (4,) — the 4 Q-values at one cell q_table[2, 3, 1] # scalar — one specific Q-value
The state is stored as (x, y) tuples (column, row) but the Q-table is indexed as q_table[y, x, action] (row, column, action). This is because NumPy follows row-major convention: the first axis is the slowest-varying. When the code does x, y = state followed by q_table[y, x, action], that's the unpack-then-flip-the-order pattern. Get this wrong and you train Q-values for the wrong cells without crashing — silent bug.
Two close-but-different operations:
q = np.array([0.1, 0.5, 0.3, 0.2]) np.max(q) # 0.5 — the VALUE np.argmax(q) # 1 — the INDEX where the max lives
You almost always want argmax when picking actions: it returns "which action is best" (an integer), not "how good is the best action" (a float).
NumPy has two APIs for randomness. The old one (np.random.randint(...)) uses a global state — convenient but bad for reproducibility. The new one creates an explicit Generator:
rng = np.random.default_rng(seed=42) rng.random() # float in [0, 1) rng.integers(4) # int in {0, 1, 2, 3}, INCLUSIVE 0, EXCLUSIVE 4 rng.integers(low=2, high=5) # int in {2, 3, 4}
The training code creates a seeded Generator (rng = np.random.default_rng(scenario.seed)). When your code uses this rng object, runs are reproducible. If you accidentally use np.random.* directly, you bypass the seed and get different results every time.
The method is integers with an s. (Plural — consistent with NumPy's "vectorized by default" mindset, even when generating one number.) rng.integer doesn't exist and will throw AttributeError. Easy typo to make.
The starter code uses Python dataclasses heavily:
from dataclasses import dataclass @dataclass(frozen=True) class Scenario: name: str goal_1_reward: float goal_2_reward: float step_reward: float episodes: int = 5000 alpha: float = 0.2 gamma: float = 0.98 # ...
This is a structured record with fields. frozen=True makes instances immutable — you can't reassign scenario.alpha = 0.3 after construction. Useful for hyperparameters: ensures the algorithm code can't accidentally mutate them mid-training.
That's enough NumPy. Onward to the math.
Before deriving Q-learning, let's nail down the language. Five concepts.
A complete description of the world right now. In this gridworld, the state is just (x, y), the agent's grid position. The state has the Markov property: it contains everything you need to know to predict the next state and reward, regardless of history.
What the agent does. Here: one of {left, right, down, up}, integer-coded as {0, 1, 2, 3}. The action is the agent's only way to influence the world.
A scalar feedback signal: how good is this transition? The agent's only goal is to maximize total reward over time. We design the reward function to encode whatever we want the agent to do.
A rule for choosing actions. Could be deterministic (a = f(s)) or stochastic (a ~ π(·|s)). In Q-learning, we derive the policy from the Q-table by argmax: pick the action with the highest Q-value.
The total discounted reward from time t to the end:
Gt = rt + γ rt+1 + γ2 rt+2 + γ3 rt+3 + ...
The discount factor γ ∈ [0, 1) is typically 0.98 here. Rewards far in the future are worth less than rewards now.
Three reasons:
rmax / (1 − γ).RL boils down to one thing:
The expectation accounts for randomness in policy and environment. We want a policy that performs well on average, not one that got lucky once.
The Q-function is a tool to find that optimal policy. Let's build it.
Q-learning rests on one beautiful equation. Here's how it falls out of the definitions.
Define the Q-value as the expected return given an action:
Now the trick. Pull the first term out of the sum:
The bracketed thing in the second term is just the return from time t+1 onward — which is itself a Q-value, evaluated at the next state and the next action. So:
This is the Bellman expectation equation. The Q-value of (s, a) decomposes into the immediate reward plus the discounted Q-value at the next step.
If we want the optimal Q-function (corresponding to the optimal policy), we replace "follow π" with "always pick the best action":
The maxa' says: from the next state, take whichever action gives the highest Q-value. This is what makes the equation an optimality condition — it specifies the value of always acting greedily from here forward.
The Bellman optimality equation is a fixed-point equation. The optimal Q* is the unique solution. If you can solve this equation — even approximately — you've solved the RL problem, because the optimal policy is just π*(s) = argmaxa Q*(s, a).
| Method | Idea | Caveat |
|---|---|---|
| Value iteration | Repeatedly update Q[s, a] ← expected RHS, sweeping all states | Requires knowing the env transition function |
| Q-learning | Sample (s, a, r, s') tuples by acting; nudge Q[s, a] toward observed RHS | Don't need transition probabilities; learn from experience |
We don't know the env's transition function in general — we just observe samples by acting in it. So we use Q-learning. Next chapter.
Q-learning is the Bellman optimality equation, written as an update rule.
If we knew the right-hand side r + γ maxa' Q*(s', a') exactly, we'd just set Q[s, a] = RHS. Done.
We don't. We have a noisy sample of the RHS, computed from one observed transition. The trick: nudge Q[s, a] toward the observed RHS, with a learning rate α controlling how much.
This is the most important equation in this homework. Memorize it.
Read it as three pieces:
r + γ maxa' Q(s', a') — the TD target. This is what we now think Q(s, a) should be, after seeing one more step of evidence.Q(s, a) — what we currently think it is.target − Q(s, a) — the TD error. How wrong was our current estimate? Positive: target is higher, push Q up. Negative: push Q down.So the update reads: "new Q = old Q + α · (how wrong we were)." Or equivalently:
"New Q = blend of old Q (weight 1−α) and the new evidence (weight α)." Smoothly averaging old belief with new data. With α=0.2 (this homework's default), each update moves us 20% of the way toward the latest evidence.
If s' is terminal (the episode just ended), there's no future:
In code, this is an if/else. If you forget and use the second formula even on terminal steps, you'd add γ maxa' Q(goal, a') — but the Q-values at the goal cell were never observed during a "from-goal" transition because the episode ended there. They're just zeros (or worse, learned junk from spurious updates). Forgetting the if/else is a real bug, not a stylistic concern.
Theorem (Watkins, 1989): if every (s, a) pair is visited infinitely often and α is decayed appropriately, tabular Q-learning provably converges to Q* — the unique fixed point of the Bellman optimality equation.
Why? At convergence, the expected TD error is zero. That happens exactly when Q satisfies the Bellman optimality equation. So the algorithm is doing stochastic gradient descent on the Bellman residual, and it converges because the Bellman operator is a contraction.
You're never told the right answer. You only see one-step samples of reality. Each sample lets you update one entry in your table by a little, in the direction of less error. Repeat 100,000 times. The errors average out, and your table converges to the truth.
The Q-learning update is one half of the algorithm. The other half: how do we choose which (s, a) to update? If we always pick the action with the highest current Q-value (greedy), we'll never try alternatives. We might miss a better path.
You need both. Pure greedy and you'll get stuck in suboptimal habits. Pure random and you'll never act on what you've learned.
Set ε high early in training (lots of exploration) and decay it low later (mostly exploit what you've learned). This homework decays linearly from 0.4 down to 0.02 across 5000 episodes:
def epsilon_for_episode(scenario, episode_idx): fraction = episode_idx / max(1, scenario.episodes − 1) return scenario.epsilon_start + fraction * (scenario.epsilon_end − scenario.epsilon_start)
At episode 0, the Q-table is all zeros. Argmax over zeros returns 0 (LEFT). Without exploration, the agent would pick LEFT every step, hit the left wall, never see a goal, and never learn anything. The Q-table would stay zero forever.
With ε=0.4, in 60% of states we still try-something-greedy, but in 40% we randomize. Over 20 timesteps per episode, that's plenty of stumbling-around-randomly to eventually trip into a goal. The first time it does, the goal reward propagates one cell into the Q-table. Next episode, the previous cell's Q-value is non-zero, so greedy starts working from there. It cascades.
Neural-net RL has the same problem on continuous actions. PPO (Problem 2) handles it via entropy bonus on a stochastic policy. Off-policy SAC (Problem 3) handles it via fixed-std Gaussian noise + replay buffer diversity. All three are forms of "make the policy stochastic enough to discover the reward."
This is the part of Q-learning that feels almost magical when you see it for the first time. The reward only happens at terminal cells — but somehow, after enough episodes, every cell in the path knows it's on a path to reward. How?
Initially, q_table is all zeros. The agent ε-greedy-walks for 20 steps, mostly randomly. By chance, on some episode it stumbles into Goal 1 from cell (4, 2) by going UP.
The transition: (4, 2) → UP → (4, 3), reward = −1 + 10 = 9, done = True.
The TD update with done:
One cell in the table now has a non-zero Q-value. Everything else is still zero.
Suppose the agent reaches (3, 2) and takes RIGHT. Transition: (3, 2) → (4, 2), reward = −1, not done.
The TD update without done:
Now (3, 2), RIGHT has a non-zero Q. The agent now "knows" that going right from (3, 2) is on a path to reward.
Repeat the propagation. Each episode, the boundary of "cells with non-zero Q" expands by roughly one cell toward the start. After enough episodes, every cell on the optimal path has a positive Q-value pointing toward the goal, geometrically decayed by γ per step.
At convergence, for cells along the optimal path:
This is what makes Q-learning work. The reward at the goal is one bit of information per episode, but it propagates back through the table like a ripple, eventually labeling every relevant state with its discounted future return.
If you printed the Q-table after every 100 episodes and color-coded by max-action-Q, you'd see a heatmap that starts dark everywhere, then a bright spot near the goal that expands outward like dye spreading through water. By episode 5000 the whole grid is illuminated and the gradient points cleanly toward the optimal goal.
One reward per episode = one TD error per episode = one Q-value updated by a meaningful amount per episode. The propagation is "1 cell per episode" in the worst case. With 5000 episodes and a path length of 7, that's plenty.
In a 100×100 grid with sparse rewards, you'd need orders of magnitude more episodes for the wave to reach all relevant cells. This is why neural-network RL on continuous environments often needs millions of steps — the "wavefront" of learned values has a longer journey to make.
q_table = np.zeros((H, W, 4)).rng = np.random.default_rng(seed).state = env.startε = epsilon_for_episode(scenario, episode)a = choose_action(q_table, state, ε, rng)s', r, done = env.step(state, a)Q[state, a] += α (target − Q[state, a])state = s'That's it. Three nested layers (scenario × episode × step), one update equation, one exploration rule. Tabular Q-learning fits in 30 lines.
Five components in the starter file: action constants, Scenario dataclass, GridWorld env, helper functions, and the training loop.
gridworld_q_learning.py:8-12LEFT = 0 RIGHT = 1 DOWN = 2 UP = 3 ACTION_NAMES = {LEFT: "left", RIGHT: "right", DOWN: "down", UP: "up"}
Just integer codes for actions. You'll see q_table[..., RIGHT] in the code, which is the same as q_table[..., 1] but reads better.
gridworld_q_learning.py:15-28@dataclass(frozen=True) class Scenario: name: str goal_1_reward: float goal_2_reward: float step_reward: float expected_outcome: str episodes: int = 5000 horizon: int = 20 alpha: float = 0.2 # learning rate gamma: float = 0.98 # discount factor epsilon_start: float = 0.4 epsilon_end: float = 0.02 seed: int = 0
Bundle of hyperparameters. Three pre-built scenarios at lines 39-61 differ only in goal_1_reward, goal_2_reward, step_reward, and expected_outcome.
gridworld_q_learning.py:64-98class GridWorld: def __init__(self, goal_1_reward, goal_2_reward, step_reward): self.start = (0, 0) self.goal_2 = (4, 0) self.goal_1 = (4, 3) self.width = 5 self.height = 4 self.goal_1_reward = goal_1_reward self.goal_2_reward = goal_2_reward self.step_reward = step_reward def step(self, state, action): x, y = state if action == LEFT: x = max(0, x − 1) if action == RIGHT: x = min(self.width − 1, x + 1) if action == DOWN: y = max(0, y − 1) if action == UP: y = min(self.height − 1, y + 1) next_state = (x, y) reward = self.step_reward done = False if next_state == self.goal_2: reward += self.goal_2_reward done = True if next_state == self.goal_1: reward += self.goal_1_reward done = True return next_state, reward, done
Deterministic transitions. max(0, ...) and min(width-1, ...) implement the "stay in place when moving off-edge" rule. Reaching either goal sets done=True and adds the goal bonus.
The env doesn't know that the goal is terminal. It just returns done=True and trusts the agent to stop. If the training loop forgets to break, the agent can keep stepping from the goal cell as if the episode hadn't ended — getting more step penalties, polluting Q-values for terminal cells. This is exactly the bug we'll discuss in Chapter 10.
gridworld_q_learning.py:107-110def action_values(q_table, state): x, y = state return q_table[y, x] # shape (4,) — the Q-values for all 4 actions at this cell
Tiny but useful. Reads "give me the row of Q-values at this state" without making the (x, y) vs (y, x) flip explicit at every call site.
Three blanks:
greedy_action(q_table, state) — pick the argmax action.choose_action(q_table, state, epsilon, rng) — ε-greedy version.train_q_learning — the actual Q-learning update.Next chapter: every line you wrote, decoded.
The centerpiece. Every line you wrote, with a linenote block per line, explaining what each NumPy/Python operation does and why.
What you wrote:
def greedy_action(q_table, state): q_values = action_values(q_table, state) action = int(np.argmax(q_values)) return action
q_values = action_values(q_table, state)Calls the helper from line 107. Returns the 4-element row of Q-values at this state.
Internally, action_values unpacks state = (x, y) and returns q_table[y, x] — shape (4,). So q_values[0] is Q for LEFT, q_values[1] is Q for RIGHT, etc.
Why use the helper instead of inlining q_table[y, x]? Three reasons: keeps the (x, y) vs (y, x) flip in one place, makes the intent clearer at the call site, and lets us swap the Q-storage convention later without rewriting everything.
action = int(np.argmax(q_values))np.argmax(q_values) returns the index of the largest element — an integer in {0, 1, 2, 3}. This is the index of the best action, which is exactly what the env's step function expects.
The wrapper int(...) converts NumPy's np.int64 result to a plain Python int. Why does this matter? Because the env asserts action in (LEFT, RIGHT, DOWN, UP), which are plain Python ints. np.int64(0) in (0, 1, 2, 3) happens to return True due to numeric equality, but explicit casting is cleaner and removes any ambiguity if you later change the assert.
The function is typed -> int — the cast also matches the type signature.
return actionTrivial. Hand the integer back to the caller. rollout_policy uses this to pick actions during evaluation, and choose_action uses it inside its else-branch.
Original first attempt: return np.max(action_values(q_table, state)). np.max returns the value (a float like 1.8), not the index. The env then asserts 1.8 in (0, 1, 2, 3), which is False, and raises AssertionError. The fix: argmax. Lesson: when picking actions, the question is "which one?" — that's argmax.
At episode 0, all Q-values are 0. np.argmax([0, 0, 0, 0]) returns 0 (LEFT) deterministically — NumPy returns the first occurrence on ties. This is fine; choose_action will use ε-greedy to break out of the LEFT-only initial behavior.
What you wrote:
def choose_action(q_table, state, epsilon, rng): if rng.random() < epsilon: action = int(rng.integers(4)) else: action = greedy_action(q_table, state) return action
if rng.random() < epsilon:Two things in one line.
1. rng.random() — samples a uniformly distributed float in [0, 1) from the seeded generator. (Range is half-open: 0 is possible, 1 is not.)
2. ... < epsilon — compares to the current epsilon. The probability of this being True is exactly epsilon, because that's the fraction of [0, 1) that lies below epsilon.
Why use rng.random() instead of np.random.random()? Reproducibility. The training script creates a seeded rng = np.random.default_rng(scenario.seed). As long as everywhere in the algorithm uses this same rng object, the entire run is deterministic for a given seed. If you slip and use the global np.random.* functions, you bypass the seed and runs become non-reproducible.
action = int(rng.integers(4))rng.integers(4) returns a uniform random integer in {0, 1, 2, 3}. The argument 4 is the exclusive upper bound. (Behind-the-scenes call signature is rng.integers(low=0, high=4).)
The int(...) wraps it to a plain Python int, same reason as in greedy_action: matches the type signature and avoids any chance of trouble at the env's assert.
This branch fires with probability ε: the action is uniformly random across all 4 actions. Maximum exploration.
If we didn't take the random branch, fall through to greedy. We could have inlined the argmax here too, but calling greedy_action avoids duplication — if you later change how greedy works (e.g., add tie-breaking), you only edit one place.
This branch fires with probability 1 − ε. Maximum exploitation.
return actionHand back the chosen action.
An earlier version of this function pre-sampled the random action before the if-check, even when going greedy:
action = rng.integers(4) # always sample, even if unused if rng.random() < epsilon: return int(action) ...
This works! But it advances the rng state on every call, even when going greedy. Two implementations that are logically equivalent can produce different training results from the same seed. The cleaner version only samples when needed.
What you wrote (inside the inner step loop):
action = choose_action(q_table, state, epsilon, rng) next_state, reward, done = env.step(state, action) if done: target = reward else: target = reward + scenario.gamma * np.max(action_values(q_table, next_state)) x, y = state q_table[y, x, action] += scenario.alpha * (target − q_table[y, x, action]) state = next_state if done: break
action = choose_action(q_table, state, epsilon, rng)Call your ε-greedy function. Returns an integer in {0, 1, 2, 3}. The current epsilon was computed by epsilon_for_episode at the start of this episode — same value across all steps within the episode, decaying across episodes.
next_state, reward, done = env.step(state, action)Actually take the action in the environment. Three return values:
• next_state — tuple (x', y'). The new agent position after the action.
• reward — scalar. The per-step reward plus any goal bonus if we landed on a goal.
• done — bool. True if next_state is a goal cell.
Python's tuple unpacking on the left-hand side names them in one line. The env is purely deterministic in this homework, so the same (state, action) always yields the same return.
Terminal case: the episode just ended. There is no future state to bootstrap from, so the target is just the observed reward. This is the most informative target of the entire episode — it's the only one with a non-bootstrapped, ground-truth value.
Non-terminal case: bootstrap. Three pieces:
• reward — the immediate step reward (e.g. −1).
• scenario.gamma — the discount factor, 0.98.
• np.max(action_values(q_table, next_state)) — maxa' Q(s', a'). We look up the Q-values at next_state (a 4-vector) and take the max. This is the agent's own current estimate of "how good is the best thing I can do from next_state." It's biased (the table is imperfect), but it's the best we have.
The whole expression: "the reward I just got, plus my best guess of the future." This is the right-hand side of the Bellman optimality equation, evaluated at one sample.
Unpack the state tuple. Recall: state is (x, y), but Q-table indexing is [y, x, action]. We unpack here to make the (x, y)-to-(y, x) flip explicit on the next line.
The Q-learning update, in NumPy.
Reading right to left:
• q_table[y, x, action] — current Q estimate, a scalar.
• (target - q_table[y, x, action]) — the TD error. How wrong was our current estimate?
• scenario.alpha * (...) — scale the error by the learning rate (0.2). We move 20% of the way toward the new evidence.
• += — in-place addition. Equivalent to q_table[y, x, action] = q_table[y, x, action] + ....
This is one of the most important lines in the homework. Memorize the shape: old_value += alpha * (target - old_value). Every TD method — gridworld, neural Q-learning, SAC — uses this same update equation.
state = next_stateWalk forward in time. The next iteration of the inner loop will use this updated state. Trivial Python variable rebinding — no NumPy involved.
Exits the inner step loop. This is essential.
Without the break, after reaching the goal, the next iteration of the inner loop would call env.step(next_state, action) with next_state being the goal cell. The env doesn't know about terminal-ness; it would happily step further and return (s'', step_reward, False). The agent would keep accumulating step rewards past the episode boundary, polluting Q-values for cells that should never have been visited from the goal.
Critical placement: the break is after the Q-update, not before. If you put the break inside the if done: target = reward block, the loop exits before the Q-update runs — meaning the terminal reward is computed but never written to the Q-table. The most informative update of the entire episode gets skipped, and learning fails completely. We caught this exact bug earlier.
Many students hit one of two bugs:
• Forgot break entirely: agent keeps walking off the goal, racks up extra step penalties, total_reward at evaluation is wrong, and Q-values for cells visited after the goal are corrupted.
• Break inside if done: terminal Q-update never happens, reward never propagates back, training never converges, evaluation produces "agent walks left into wall forever."
The fix is the same in both cases: one break, at the bottom of the loop, after the Q-update.
The full inner loop, distilled:
Six lines of pseudocode. ~10 lines of actual code. That's the entire engine of tabular Q-learning.
Same algorithm, three different reward structures, three different optimal policies. The deliverable for Task 1.3 is to report which goal each scenario reaches and explain why in one sentence.
Compare the two paths:
Goal 1 wins by 2. The Q-table reflects this: at the start, RIGHT and UP have the highest values (both lie on a path toward Goal 1's diagonal).
Observed: trajectory through the diagonal toward Goal 1, total reward = 3. ✓
One-sentence explanation: Going to Goal 1 gives 7×(−1) + 10 = 3, while Goal 2 gives 4×(−1) + 5 = 1, so the agent learns Goal 1 is the better target.
Goal magnitudes unchanged. Step penalty doubled. Compare again:
Goal 2 wins by 1. The 3 extra steps to Goal 1 cost 6 (with the doubled penalty), wiping out the 5-point reward gap.
Observed: trajectory straight right to Goal 2, total reward = −3. ✓
One-sentence explanation: Now Goal 1 gives 7×(−2) + 10 = −4 but Goal 2 gives 4×(−2) + 5 = −3, so doubling the step cost flips which goal is better and the agent switches to Goal 2.
Step reward is now positive. Compare again:
Stalling wins by a lot. The agent learns to walk LEFT into the wall, soaking up +1 per step for the full horizon. Reaching a goal would terminate that profitable stream after just one bonus.
Observed: trajectory of 20 LEFT actions, agent stays at (0, 0) the whole time, total reward = 20. ✓
Q-values at start are all 50.0: that's the infinite-horizon discounted return of "+1 forever," computed exactly as 1 / (1 − γ) = 1 / 0.02 = 50. Q-learning converged to the optimal value of "stay alive forever."
One-sentence explanation: Reaching the closest goal (Goal 2 in 4 steps) gives 4×(+1) + 1 = 5, but stalling against the wall for the full horizon gives 20×(+1) + 0 = 20, so the agent learns to never finish the task because the +1 per step pays more than any goal bonus.
Three small changes to the reward function produce three completely different optimal policies, including one (scenario 3) that refuses to do the task you wanted. This is called reward hacking — the agent finds a way to maximize the reward you specified, which is not the behavior you intended.
In real-world RL, reward design is one of the hardest engineering problems. Sparse rewards (P2/P3) are intentionally chosen to avoid this trap: with reward only at completion, the only way to get reward is to actually complete the task. The cost is that learning is slower because reward is rare.
You also reported the Q-values at the start state. Look at scenario 1:
start_q_values: (1.219, 2.265, 1.219, 2.265)
LEFT RIGHT DOWN UP
LEFT and DOWN are tied at 1.219 — both move the agent into a wall (LEFT off the left edge, DOWN off the bottom edge), keeping it in place but still costing the step penalty. RIGHT and UP are tied at 2.265 — both move toward Goal 1's "diagonal." The Q-table has correctly figured out the symmetry.
For scenario 2:
start_q_values: (-4.996, -3.057, -4.996, -4.329)
LEFT RIGHT DOWN UP
RIGHT is now distinctly better than UP — because the path to Goal 2 (going right four times) is now the optimal path. UP would head toward the now-suboptimal Goal 1. The Q-table has figured out that the policy should change.
This is why looking at the actual numbers after training is so satisfying with tabular Q-learning. You can directly inspect the values it learned and verify they encode the right preferences.
| Symbol | Code | Value | What it does |
|---|---|---|---|
| α | scenario.alpha | 0.2 | Learning rate — how much the Q-table moves per update |
| γ | scenario.gamma | 0.98 | Discount factor — how much the future is worth |
| εstart | scenario.epsilon_start | 0.4 | Initial random-action probability |
| εend | scenario.epsilon_end | 0.02 | Final random-action probability |
| episodes | scenario.episodes | 5000 | Number of training episodes |
| horizon | scenario.horizon | 20 | Max steps per episode |
| seed | scenario.seed | 0 | Random seed for reproducibility |
np.max and np.argmax? When do you want each?if done: break? Where does the break go?if done in the TD target and always used the bootstrap formula, what would happen?Q ← Q + α (target − Q) mean intuitively?rng.integers(4) return, and why use rng instead of np.random?(height, width, 4) and not (width, height, 4)?1. Because the policy needs to know which action is best at each state. A V(s)-only table would say "this cell is good" but not "go RIGHT from here." With Q(s, a), greedy action selection is just argmax over the action axis.
2. γ trades off immediate vs. future rewards. With γ=1 in scenario 3, the infinite-horizon stalling return would diverge to infinity, the Q-values would never converge, and the algorithm would be ill-defined. γ < 1 makes the sum bounded.
3. Zero is a neutral starting point that doesn't bias exploration. With random init, some actions would have spuriously high values that exploitation would lock onto early, before exploration could correct them. (In neural Q-learning we sometimes need careful initialization, but tabular benefits from the simpler zero start.)
4. np.max returns the largest value (a number); np.argmax returns the index of the largest value (an integer). For action selection you always want argmax. For computing the bootstrap target r + γ maxa' Q(s', a'), you want max.
5. Because the env doesn't know about terminal states — it'll keep stepping if you keep calling step(). The break exits the inner loop on terminal so we don't pollute Q-values for cells visited after the goal. The break goes after the Q-update, not before, so the terminal reward gets written to the table.
6. You'd add γ max Q(s_goal, a') to the target on terminal steps. Initially Q(goal, *) = 0, so the target would equal the reward and you'd be lucky. But once the agent visits states after the goal (because of the missing break), Q(goal, *) becomes non-zero and biases all future targets upward.
7. Early on, the Q-table is uninformative, so greedy action selection is essentially random and we should mix in real randomness to discover reward. Late in training, Q is accurate, so we should mostly trust it. Linearly decaying ε gets both phases.
8. "Move the current estimate 20% of the way toward the new evidence." Equivalently: Q = (1−α) old_Q + α new_target. It's a smoothed running average of one-sample TD targets.
9. With rstep=0 and equal goal rewards of 1, both goals give equal total return regardless of path length. The agent would learn to reach some goal, but which one depends on epsilon-greedy randomness and the seed. Almost certainly Goal 2 (closer) due to early random exploration finding it first.
10. At episode 0 with Q all zeros, argmax always returns LEFT (action 0). Without exploration, the agent walks LEFT into the wall every step, never sees a goal, never updates the Q-table. Training fails completely.
11. rng.integers(4) returns a uniform random integer in {0, 1, 2, 3} (4 is exclusive upper bound). We use the seeded rng for reproducibility — the global np.random bypasses the seed.
12. NumPy is row-major: the first axis is the slowest-varying. The natural way to print or visualize a grid is row-by-row, so putting height first lets q_table[y] give you a whole row, which matches how grids are typically rendered.
cd /Users/ozyphus/Documents/GitHub/cs224r/homework/hw2/hw2
python gridworld_q_learning.py
Should print three blocks (one per scenario), each with expected/observed outcome, trajectory, actions, total reward, and start Q-values. If observed_outcome matches expected_outcome for all three, your implementation is correct.
Three big ideas, in order of importance:
Q ← Q + α (target − Q). Same equation, scaled up to neural nets, becomes DQN, SAC, and every value-based method in modern RL.If a friend asks: "How does Q-learning work?" — you say: "It's stochastic gradient descent on the Bellman residual. You don't know the right Q-values, but every transition gives you a noisy sample of the right-hand side. Move toward that sample by α each time. Repeat enough and the Q-table converges to the optimal Q-function. The two tricks: discount future rewards by γ to make sums finite, and use ε-greedy to ensure every state-action gets sampled."
You can teach this. On to Problem 2.