HELP

+40 722 606 166

messenger@eduailast.com

Reinforcement Learning: From Zero to Hero with Python

Reinforcement Learning — Beginner

Reinforcement Learning: From Zero to Hero with Python

Reinforcement Learning: From Zero to Hero with Python

Build RL agents from first principles to deep Q-networks in Python.

Beginner reinforcement-learning · rl · markov-decision-process · q-learning

About this course

Reinforcement learning (RL) is the toolkit for building agents that learn by trial and error—optimizing decisions over time rather than predicting labels. This book-style course takes you from first principles to a working Deep Q-Network (DQN) pipeline in Python, with a strong emphasis on clarity, clean experimentation, and practical debugging. You’ll start with the basic agent–environment loop and end with modern deep RL building blocks like replay buffers and target networks.

The progression is intentional: you’ll first learn to think in RL (returns, policies, value), then formalize problems as Markov decision processes (MDPs), then solve small MDPs with planning (dynamic programming), then move to model-free learning (Monte Carlo and temporal-difference), then master Q-learning, and finally scale up to deep function approximation with DQN.

Who this is for

  • Beginners who know basic Python and want a structured, no-handwaving path into reinforcement learning.
  • Practitioners who have “heard of Q-learning/DQN” but want to implement and evaluate them confidently.
  • Students preparing for interviews or research projects who need a coherent mental model and solid baselines.

What you’ll build and practice

Across the six chapters, you’ll repeatedly practice the same professional workflow: define the task, choose an algorithm, implement a minimal baseline, run controlled experiments, and diagnose failure modes. You’ll learn why RL can be unstable, how to interpret learning curves, and how to make improvements that are actually attributable to a change (not randomness).

  • Model tasks as MDPs and use Bellman equations to reason about value and optimality.
  • Implement planning methods (policy iteration, value iteration) to get “gold standard” solutions when the model is known.
  • Learn from experience with Monte Carlo and TD methods, then move to control with SARSA and Q-learning.
  • Train a DQN with the essential stabilizers: replay buffer, target network, sensible loss functions, and practical training heuristics.

How the book is structured

Each chapter reads like a compact technical chapter with milestones and sub-sections that build directly on what came before. You won’t jump into deep networks until you can explain what a value function is and why the Bellman target looks the way it does. By the time you reach DQN, concepts like off-policy learning, bootstrapping, and maximization bias will already be familiar—so the deep RL “magic” becomes a set of understandable design choices.

Get started

If you’re ready to learn reinforcement learning step by step—without skipping fundamentals—start here and follow the path chapter by chapter. Register free to track your progress, or browse all courses to compare related learning paths.

Outcomes

By the end, you’ll be able to implement and evaluate core RL algorithms, choose appropriate baselines, and reason about why an agent is (or isn’t) learning. Most importantly, you’ll have a durable framework for learning more advanced methods like PPO and actor-critic approaches with confidence.

What You Will Learn

  • Explain core RL concepts: agent, environment, reward, policy, value, and returns
  • Model problems as Markov decision processes and reason about optimality
  • Implement dynamic programming methods for tabular MDPs
  • Build and evaluate Monte Carlo and temporal-difference learning agents
  • Implement SARSA and Q-learning with exploration strategies and tuning
  • Train a Deep Q-Network (DQN) with replay buffers and target networks
  • Run clean RL experiments with metrics, baselines, and ablations
  • Diagnose common RL failure modes (instability, overestimation, sparse rewards)

Requirements

  • Basic Python (functions, lists/dicts, NumPy basics)
  • High-school algebra; comfort with simple probability notation is helpful
  • A computer that can run Python locally (GPU optional)

Chapter 1: Reinforcement Learning Basics and Problem Framing

  • Milestone: Translate real tasks into agent–environment loops
  • Milestone: Compute returns and understand discounting
  • Milestone: Define policies, value functions, and the Bellman intuition
  • Milestone: Set up your RL Python toolkit and first toy environment
  • Milestone: Establish evaluation habits (seeds, logging, plots)

Chapter 2: Markov Decision Processes and Bellman Equations

  • Milestone: Formalize an RL task as an MDP
  • Milestone: Derive Bellman expectation equations for V and Q
  • Milestone: Use Bellman optimality to reason about greedy improvement
  • Milestone: Implement a small finite MDP and compute values numerically
  • Milestone: Validate assumptions and spot when Markov property breaks

Chapter 3: Planning with Dynamic Programming (DP)

  • Milestone: Solve known MDPs with iterative policy evaluation
  • Milestone: Implement policy iteration end-to-end
  • Milestone: Implement value iteration and compare with policy iteration
  • Milestone: Analyze convergence, complexity, and stopping criteria
  • Milestone: Use DP solutions as gold baselines for later learning

Chapter 4: Model-Free Prediction and Control (MC and TD)

  • Milestone: Estimate Vπ with Monte Carlo methods from episodes
  • Milestone: Implement TD(0) and compare bias/variance vs MC
  • Milestone: Use n-step returns to bridge MC and TD
  • Milestone: Implement SARSA for on-policy control
  • Milestone: Build an experiment harness to compare learning curves

Chapter 5: Off-Policy Learning and Q-Learning Mastery

  • Milestone: Implement tabular Q-learning for control
  • Milestone: Understand and mitigate maximization bias
  • Milestone: Apply off-policy evaluation basics and importance sampling
  • Milestone: Tune exploration schedules and learning rates systematically
  • Milestone: Diagnose failures (divergence, reward hacking, sparse rewards)

Chapter 6: Deep Reinforcement Learning with DQN and Beyond

  • Milestone: Build a neural Q-function approximator in PyTorch
  • Milestone: Train a DQN with replay buffer and target network
  • Milestone: Stabilize learning with normalization, clipping, and schedules
  • Milestone: Evaluate generalization and robustness across seeds
  • Milestone: Map next steps: policy gradients and actor-critic overview

Dr. Maya Chen

Senior Machine Learning Engineer (Reinforcement Learning)

Dr. Maya Chen is a senior machine learning engineer specializing in reinforcement learning for decision-making systems and simulation-based optimization. She has shipped RL prototypes in robotics and operations research workflows and mentors engineers on reliable experimentation and evaluation.

Chapter 1: Reinforcement Learning Basics and Problem Framing

Reinforcement Learning (RL) is the branch of machine learning focused on learning by interaction: an agent takes actions in an environment, receives feedback (rewards), and gradually improves its behavior. This chapter builds the mental model you will use for the rest of the course: translating messy real tasks into a clean agent–environment loop, computing returns with discounting, defining policies and value functions, and developing the “Bellman intuition” that underpins dynamic programming, Monte Carlo, temporal-difference learning, and Deep Q-Networks.

Along the way, you’ll also establish practical engineering habits—reproducible seeds, lightweight logging, and simple plots—so you can tell whether an improvement is real or accidental. You will set up a small Python toolkit and implement a first toy environment with a Gymnasium-style API, because the fastest way to understand RL is to run the loop yourself and observe learning curves.

  • Milestone: Translate real tasks into agent–environment loops
  • Milestone: Compute returns and understand discounting
  • Milestone: Define policies, value functions, and the Bellman intuition
  • Milestone: Set up your RL Python toolkit and first toy environment
  • Milestone: Establish evaluation habits (seeds, logging, plots)

By the end of this chapter you should be able to look at a task (playing a game, controlling a robot, recommending content, scheduling jobs) and express it as an MDP-style loop with states, actions, rewards, and termination conditions. That framing is the foundation for everything that follows: dynamic programming on tabular MDPs, Monte Carlo and TD learning, SARSA and Q-learning, and eventually DQN.

Practice note for Milestone: Translate real tasks into agent–environment loops: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Compute returns and understand discounting: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Define policies, value functions, and the Bellman intuition: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Set up your RL Python toolkit and first toy environment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Establish evaluation habits (seeds, logging, plots): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Translate real tasks into agent–environment loops: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Compute returns and understand discounting: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Define policies, value functions, and the Bellman intuition: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Set up your RL Python toolkit and first toy environment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: What makes RL different from supervised learning

Section 1.1: What makes RL different from supervised learning

Supervised learning gives you a fixed dataset of input–output pairs and asks you to learn a mapping. Reinforcement learning rarely has that luxury. Instead, the data you learn from is generated by your current policy. When you change the policy, you change the distribution of future experiences. This feedback loop makes RL feel less like “fit a function” and more like “run an experiment, observe outcomes, revise the experiment.”

Three differences matter most in practice. First, RL deals with delayed consequences: an action can hurt you much later, so you need a way to assign credit over time. Second, RL involves exploration: you must try actions that may be worse short-term to discover better long-term strategies. Third, RL learning is often non-stationary because your agent’s behavior changes, and with it the data you see.

  • Common mistake: expecting i.i.d. samples. Trajectories are correlated in time, so naive training can be unstable.
  • Common mistake: optimizing the wrong metric. In RL you optimize expected return, not immediate reward or accuracy.
  • Engineering judgment: start with a tiny environment you can reason about, then scale up. If you can’t explain a learning curve, you can’t trust it.

Throughout this course, you will keep returning to one question: “What does the agent control, what does the environment control, and what data will my current policy generate?” Framing the loop correctly is the first milestone and the biggest multiplier on your debugging speed.

Section 1.2: Agent, environment, states, actions, rewards

Section 1.2: Agent, environment, states, actions, rewards

An RL problem is an agent–environment loop. At time step t, the environment provides a state (or observation) st. The agent chooses an action at. The environment transitions to a new state st+1 and emits a reward rt+1. This repeats until the episode ends (or forever in continuing tasks). Your job is to make this loop explicit for any real task you encounter.

In many textbooks, the environment is formalized as a Markov Decision Process (MDP): transitions depend only on the current state and action, not the full history. That “Markov property” is not a philosophical statement—it’s a modeling choice. If your state omits important context (e.g., velocity in control, inventory in scheduling), your environment will appear non-Markov and learning will be harder. When in doubt, enrich the state with the missing information or use history-based observations later.

  • State vs observation: the true state may be hidden; you may only observe a partial view. Many Gymnasium environments return observations, not guaranteed Markov states.
  • Action space: discrete actions (left/right) vs continuous actions (torques). This course starts with discrete/tabular settings because they teach the core ideas cleanly.
  • Reward: a scalar feedback signal designed to align behavior with goals. Good reward design is powerful; bad reward design is a frequent source of “smart” agents that do the wrong thing.

Milestone: translate a task into the loop by writing down (1) what the agent observes, (2) what it can do, (3) when it receives reward, and (4) what ends an episode. If you cannot define these unambiguously, you don’t yet have an RL problem—you have a product requirement. The act of specifying the loop is often where hidden assumptions and edge cases surface.

Section 1.3: Episodic vs continuing tasks and discount factor

Section 1.3: Episodic vs continuing tasks and discount factor

RL tasks come in two flavors. Episodic tasks have a natural terminal state: a game ends, a robot falls, a trading day closes. Continuing tasks run indefinitely: thermostat control, server autoscaling, ongoing recommendation. This distinction matters because it changes how we measure success and how we compute the objective the agent should maximize.

The central quantity is the return, the total reward accumulated from a time step onward. In episodic settings, the undiscounted return is often written as Gt = rt+1 + rt+2 + … + rT. In continuing or long-horizon settings, we typically use a discount factor γ (gamma), with 0 ≤ γ < 1, to define Gt = rt+1 + γ rt+2 + γ² rt+3 + …. Discounting makes the infinite sum finite and emphasizes near-term outcomes.

  • Interpretation: smaller γ makes the agent more short-sighted; larger γ values long-term planning. γ is not “right or wrong”—it encodes a preference and affects learnability.
  • Common mistake: choosing γ close to 1 with sparse rewards and expecting fast learning. High γ increases variance and can slow progress without additional shaping.
  • Practical habit: compute and print sample returns from a few episodes early. If returns don’t match your intuition, your reward or termination conditions are probably mis-specified.

Milestone: compute returns and understand discounting. In code, that means being able to take a trajectory of rewards and compute Gt for each step, and being able to explain how changing γ changes what “good behavior” means. This is the first step toward value functions and Bellman equations.

Section 1.4: Policies, value functions, and optimal behavior

Section 1.4: Policies, value functions, and optimal behavior

A policy tells the agent what to do. In the simplest case, it’s a function mapping states to actions. More generally, a stochastic policy maps states to a distribution over actions: π(a|s). Stochasticity is not just for randomness; it can be essential for exploration and for optimality in some environments.

A value function measures how good it is to be in a state (or take an action in a state) under a policy. The state-value function is Vπ(s) = E[Gt | st=s, π]. The action-value function is Qπ(s,a) = E[Gt | st=s, at=a, π]. Values are predictions of future return; policies use those predictions to choose actions.

The “Bellman intuition” is that values satisfy a self-consistency relationship: today’s value equals today’s reward plus discounted value of tomorrow. In expectation form, V(s) is roughly E[r + γ V(s′)]. This idea powers dynamic programming (value iteration, policy iteration) when you know the model, and it also motivates learning rules like temporal-difference updates when you don’t.

  • Optimality: an optimal policy π* achieves the highest expected return from each state. Its value functions satisfy Bellman optimality equations.
  • Common mistake: confusing “reward” with “value.” Reward is immediate feedback; value is a long-term prediction.
  • Practical outcome: you can debug an agent by checking if its value estimates are plausible (e.g., higher near goal states, lower near hazards).

Milestone: define policies and value functions clearly enough that you could implement tabular dynamic programming for a tiny MDP. You will do exactly that in the next chapters: first with known transitions (planning), then with sampled experience (learning).

Section 1.5: Reward design, credit assignment, and exploration preview

Section 1.5: Reward design, credit assignment, and exploration preview

Reward design is where RL meets reality. A reward is a proxy for your true objective, and proxies can be exploited. If you reward “move forward,” an agent may spin its wheels in place to trigger sensors. If you reward “time alive,” it may learn to hide. The goal is not to design a perfect reward—it’s to design one that leads to the behavior you want under optimization pressure.

Credit assignment is the challenge of figuring out which earlier actions caused a later outcome. Sparse terminal rewards (win/lose) are clean but hard to learn from; dense shaping rewards provide guidance but risk changing the optimal policy if not done carefully. A practical approach is iterative: start with a reward that matches the task definition, add minimal shaping that reflects progress, then verify the agent isn’t finding loopholes.

  • Checklist for rewards: Is the reward bounded? Does it align with success? Can it be gamed? Is it comparable across episodes?
  • Exploration preview: greedy policies get stuck. You will later use ε-greedy, softmax, and other strategies to trade off trying new actions vs exploiting known good ones.
  • Common mistake: changing multiple things at once (reward + γ + algorithm). When learning fails, you won’t know why.

This course will repeatedly revisit exploration and tuning. For now, adopt an engineering mindset: write down the intended behavior, list plausible failure modes, and add diagnostics (episode length, reward components, success rate). These habits make later algorithms—SARSA, Q-learning, DQN—much easier to evaluate honestly.

Section 1.6: Practical setup: Python, NumPy, Gymnasium-style APIs

Section 1.6: Practical setup: Python, NumPy, Gymnasium-style APIs

RL becomes concrete when you can run the loop. Use Python 3.10+ and create an isolated environment (venv or conda). Install core packages you will rely on throughout: NumPy for arrays, matplotlib for plots, and a Gymnasium-compatible API for environments (either Gymnasium itself or simple custom environments that mimic its interface). Even if you later move to deep learning, the same structure persists.

A Gymnasium-style environment typically exposes reset(seed=...) and step(action). reset returns an initial observation (and info). step returns (obs, reward, terminated, truncated, info). Your agent loop collects these into trajectories. For a first toy environment, implement something tiny you can reason about, such as a 1D random-walk to a goal. Keep the state space small so you can print tables and verify behavior.

  • Seeding: set seeds for NumPy and the environment. Without this, you can’t tell if changes improved learning or just got lucky.
  • Logging: record episode return, episode length, and a moving average. Save hyperparameters with results.
  • Plots: plot return vs episode with confidence bands across multiple runs. Single-run curves are often misleading.

Milestone: establish evaluation habits. From day one, run at least 3–10 seeds for any claim you make, and keep your plots simple and comparable. RL variance is real; disciplined evaluation is how you avoid “hero demos” that can’t be reproduced. With this toolkit and loop in place, you are ready to implement your first baseline agents in the next chapter and improve them methodically.

Chapter milestones
  • Milestone: Translate real tasks into agent–environment loops
  • Milestone: Compute returns and understand discounting
  • Milestone: Define policies, value functions, and the Bellman intuition
  • Milestone: Set up your RL Python toolkit and first toy environment
  • Milestone: Establish evaluation habits (seeds, logging, plots)
Chapter quiz

1. Which framing best captures a real task as a reinforcement learning problem in this chapter?

Show answer
Correct answer: An agent repeatedly observes a state, takes an action, receives a reward, and continues until termination
RL is defined here as learning by interaction via an agent–environment loop with states, actions, rewards, and termination.

2. What does it mean to compute a return with discounting?

Show answer
Correct answer: Combine rewards over time while weighting future rewards less than immediate rewards
Discounting is introduced as the mechanism for valuing near-term rewards more than far-future rewards when computing returns.

3. In the chapter’s core definitions, what is a policy?

Show answer
Correct answer: A rule (mapping) the agent uses to choose actions based on what it observes (state)
Policies are described as the agent’s behavior: how it selects actions in the environment.

4. What is the purpose of developing “Bellman intuition” in this chapter?

Show answer
Correct answer: To understand the recursive relationship behind value functions that supports methods like dynamic programming, Monte Carlo, TD learning, and DQN
The chapter highlights Bellman intuition as foundational for value-based reasoning used across major RL algorithm families.

5. Which practice best helps you tell whether a reported improvement is real rather than accidental, according to the chapter?

Show answer
Correct answer: Use reproducible seeds along with lightweight logging and simple plots
The chapter emphasizes evaluation habits—seeds, logging, and plots—to distinguish genuine progress from randomness.

Chapter 2: Markov Decision Processes and Bellman Equations

Reinforcement learning becomes much easier to reason about once you commit to a clean mathematical model of “what the world is” and “what the agent can do.” This chapter builds that model using Markov Decision Processes (MDPs) and then derives the Bellman equations, which are the backbone of dynamic programming, policy evaluation, and most value-based RL algorithms.

In practice, the biggest upgrade you will make as an RL engineer is to move from an intuitive description (“the agent navigates a grid and gets points”) to a precise specification you can compute with: a state space, an action set, transition probabilities, a reward function, and a discount factor. Once you have those pieces, you can compute the value of a policy numerically, compare policies, and reason about how greedy improvement leads toward optimal behavior.

This chapter’s milestones map directly onto a real workflow: formalize the RL task as an MDP; derive Bellman expectation equations for V and Q; use Bellman optimality to justify greedy improvement; implement a tiny MDP in Python and compute values; and finally, learn to recognize when your “state” is missing information and the Markov property breaks.

Practice note for Milestone: Formalize an RL task as an MDP: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Derive Bellman expectation equations for V and Q: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Use Bellman optimality to reason about greedy improvement: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement a small finite MDP and compute values numerically: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Validate assumptions and spot when Markov property breaks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Formalize an RL task as an MDP: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Derive Bellman expectation equations for V and Q: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Use Bellman optimality to reason about greedy improvement: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement a small finite MDP and compute values numerically: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Validate assumptions and spot when Markov property breaks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: MDP definition: (S, A, P, R, gamma)

An MDP is the standard formal object for many RL problems. You specify a tuple (S, A, P, R, γ):

  • S: the set of states (what the agent “knows” at decision time)
  • A: the set of actions available to the agent
  • P: the transition dynamics, typically P(s'|s,a)
  • R: the reward model, often R(s,a,s') or R(s,a)
  • γ: discount factor in [0,1), controlling how future rewards matter

To hit the first milestone (“formalize an RL task as an MDP”), start from the environment description and force yourself to write each component explicitly. For a gridworld, S might be cell coordinates; A could be {up, down, left, right}; P could be deterministic or “slippery” with probability of unintended motion; R might be -1 per step and +10 at goal; and γ might be 0.99 to prioritize reaching the goal quickly while still valuing future outcomes.

Engineering judgment: keep the model as small as possible while staying faithful to the task. Huge state spaces make computation expensive; overly compressed state spaces can break the Markov property (covered in Section 2.2). Common mistakes include mixing up rewards and returns (reward is immediate; return is accumulated), or forgetting to define what happens at terminal states. A practical habit is to document terminal conditions and transitions out of terminal states (usually self-loop with zero reward) so your evaluation code doesn’t contain hidden exceptions.

Finally, interpret γ as a knob: lower values make learning and evaluation more stable and myopic; values close to 1 make the agent far-sighted but can increase variance and slow convergence. In continuing tasks without natural termination, γ is also what keeps the sum of future rewards finite.

Section 2.2: Markov property and state representation

The Markov property is the assumption that makes MDPs work: given the current state s and action a, the distribution of the next state and reward is independent of the past. Informally: the state must summarize everything from history that matters for the future.

This is not just theory; it is a design constraint on your state representation. If your state omits a variable that affects dynamics or reward, your value estimates will look noisy, inconsistent, or “non-learnable,” because the same state-action pair leads to different outcomes depending on hidden history. For example, imagine a robot with a battery that drains. If the state includes only position but not battery level, then the transition probability to “stuck” depends on how long you have been moving. That violates Markov, and your MDP model is wrong even if your code runs.

To validate Markov assumptions (a key milestone), run a simple diagnostic: collect many transitions from the same observed state-action pair and check if the empirical distribution of next states/rewards changes when you condition on earlier history. In tabular environments you can do this with counts; in continuous environments you can approximate with bins or a learned classifier that tries to predict next outcomes from history beyond s. If history improves prediction, your state is missing information.

  • Common mistake: treating an observation as a state. In many simulators, the “observation” is partial (e.g., a camera frame), not Markov by itself.
  • Practical fix: augment the state with derived variables (velocity, timers, last action), or use memory (Section 2.6).
  • Workflow tip: write down what physics or rules govern the environment, then list which variables affect those rules. Your state should include them or you must accept partial observability.

When your state is well-formed, the rest of this chapter becomes powerful: Bellman equations become correct identities, and dynamic programming becomes a reliable tool rather than a fragile approximation.

Section 2.3: Bellman expectation equations for Vc0 and Qc0

Once the task is an MDP and you have a policy π(a|s), you can define value functions that measure long-term desirability. The state-value is Vπ(s) = E[ Gt | St=s ], where the return Gt = Rt+1 + γRt+2 + γ2Rt+3 + . The action-value is Qπ(s,a) = E[ Gt | St=s, At=a ].

The Bellman expectation equations (milestone) express a self-consistency: today’s value equals immediate reward plus discounted value of tomorrow, under the same policy. For Vπ:

Vπ(s) = 3 3 π(a|s) 3 3 P(s'|s,a) 3 [ R(s,a,s') + γ Vπ(s') ]

For Qπ:

Qπ(s,a) = 3 3 P(s'|s,a) 3 [ R(s,a,s') + γ 3 3 π(a'|s') Qπ(s',a') ]

These are not “algorithms” yet; they are equations you can use to build algorithms. In tabular settings, you can solve them as a linear system (exact policy evaluation) or iterate them (iterative policy evaluation). Engineering judgment: iteration is usually simpler and numerically robust, but you need stopping criteria (e.g., max change < 1e-8) and you must handle terminal states carefully (no future value beyond terminal).

Common mistakes: mixing up which expectation belongs where (policy averages over actions; environment averages over next states), and forgetting that rewards can depend on (s,a,s'). When debugging, print a single state’s Bellman target and verify each term: probabilities sum to 1, rewards match your environment, and discounting is applied exactly once per step.

Section 2.4: Bellman optimality equations and greedy policies

Expectation equations evaluate a fixed policy. Optimality equations answer the harder question: what is the best achievable value? Define V* (s) = max over policies π of Vπ(s), and similarly Q* (s,a). The Bellman optimality equations replace the policy expectation with a max:

V* (s) = maxa 3 3 P(s'|s,a) 3 [ R(s,a,s') + γ V*(s') ]

Q* (s,a) = 3 3 P(s'|s,a) 3 [ R(s,a,s') + γ maxa' Q*(s',a') ]

This section supports the milestone “use Bellman optimality to reason about greedy improvement.” The core idea is policy improvement: if you have an estimate of action-values Qπ, you can build a new policy π' that is greedy with respect to it: π'(s) = argmaxa Qπ(s,a). Under standard conditions, this greedy policy is guaranteed to be no worse than π (and often better). This is the theoretical engine behind policy iteration and value iteration.

Practical takeaway: “greedy” is safe as an improvement step when your Q-values are accurate. In learning algorithms, Q-values are approximate and noisy, so pure greedy action selection can get stuck early. That is why later chapters introduce exploration strategies (e.g., ε-greedy). But the target you learn from in many methods still uses a greedy max term 4 that is Bellman optimality showing up inside learning updates.

  • Common mistake: using max over next-state values but evaluating under a non-greedy policy (accidentally mixing optimality and expectation forms).
  • Engineering tip: be explicit about whether your update is “on-policy” (uses the policy’s action distribution) or “off-policy” (uses a greedy max). Naming functions bellman_expectation_target vs bellman_optimality_target helps prevent subtle bugs.

When you internalize that Bellman optimality is simply “do the best action now and assume best behavior later,” you can reason about optimal strategies without running simulation, as long as your MDP model is correct.

Section 2.5: Finite MDPs, transition matrices, and sanity checks

In small, finite MDPs, you can compute values numerically and use them as ground truth. This is a milestone because it builds intuition and gives you a debugging reference before you move to sampling-based learning. The key is to represent dynamics as matrices and vectors.

Suppose |S|=n and |A|=m. For each action a, define an n 7n transition matrix Pa where Pa[i,j] = P(sj|si,a). Define an expected reward vector ra of length n with ra[i] = 3 P(s'|si,a) R(si,a,s'). For a stochastic policy π, the induced transition matrix is Pπ = 3a diag(π( a|s)) Pa (conceptually: average transitions by action probabilities). Then the Bellman expectation equation becomes a linear system:

Vπ = rπ + γ Pπ Vπ 7 7 7 (I - γ Pπ) Vπ = rπ

In Python, you can solve this with numpy.linalg.solve for exact evaluation (assuming the matrix is well-conditioned). Or you can perform iterative evaluation: initialize V=0 and repeatedly apply V  r + γ P V until convergence. Iteration is closer to how dynamic programming is usually implemented and mirrors later TD learning.

  • Sanity check 1: each row of each Pa must sum to 1 (within floating tolerance).
  • Sanity check 2: terminal states should have either zero reward and self-loop transitions, or be handled explicitly 4 but be consistent.
  • Sanity check 3: compare iterative vs closed-form solution for Vπ; they should match.

A practical exercise is to implement a 3-state MDP (start, middle, terminal) with two actions (safe vs risky), compute Vπ for a chosen policy, then compute V* by value iteration using the optimality update. Seeing the numbers stabilize over iterations builds the mental model you will rely on later when values are learned from samples instead of computed from known P and R.

Section 2.6: Partial observability and when you need memory

Not every RL problem fits neatly into an MDP with a small, clean state. Many realistic environments are partially observable: the agent receives observations o rather than true states s, and those observations may not satisfy the Markov property. This is where many projects fail quietly 4 the code runs, but learning stalls or produces brittle behavior.

Classic examples: a noisy sensor, a hidden mode (e.g., “slippery floor” today vs not), or a delayed effect (a door that closes 5 seconds after you push a button). If the observation doesn’t contain the hidden variable, two identical observations can require different actions depending on past events. Your model is now a POMDP (Partially Observable MDP), and the Bellman equations on observations are not valid identities.

How do you proceed in practice? First, decide whether you can restore Markov by state augmentation. Add variables you can compute from data: elapsed time, last action, last reward, velocity, inventory, cooldown timers. In environments built from game rules, it is often possible to expose the internal state directly and keep the problem as an MDP.

If augmentation is impossible (true hidden state), you need memory. Common approaches include stacking recent observations (e.g., last 4 frames), using a recurrent network (GRU/LSTM) to build a belief-like internal state, or maintaining an explicit Bayesian belief state when the model is known. Even in tabular settings, you can create a “history state” like (ot-1, ot) to approximate Markov behavior.

  • Symptom: value estimates for the same observation have high variance and depend on untracked history.
  • Debug tactic: log short trajectories and look for “aliasing”: identical observations followed by different correct actions.
  • Practical outcome: you become deliberate about what you call a state, and you know when MDP-based dynamic programming is appropriate versus when you must switch to memory-based agents.

The key milestone here is not to avoid partial observability, but to spot it early. Once you can articulate exactly why the Markov property breaks, you can choose an engineering solution 4 augment, add memory, or accept approximation 4 instead of guessing why learning is unstable.

Chapter milestones
  • Milestone: Formalize an RL task as an MDP
  • Milestone: Derive Bellman expectation equations for V and Q
  • Milestone: Use Bellman optimality to reason about greedy improvement
  • Milestone: Implement a small finite MDP and compute values numerically
  • Milestone: Validate assumptions and spot when Markov property breaks
Chapter quiz

1. Which set of components is required to precisely specify an RL task as a Markov Decision Process (MDP) in this chapter?

Show answer
Correct answer: State space, action set, transition probabilities, reward function, and discount factor
The chapter emphasizes moving from an intuitive description to a computable specification: states, actions, transitions, rewards, and discounting.

2. What do the Bellman expectation equations enable you to do for a fixed policy?

Show answer
Correct answer: Express V and Q in terms of expected immediate reward plus discounted expected value of successor states (and compute them numerically)
Bellman expectation equations relate value functions to one-step lookahead expectations under a given policy, supporting policy evaluation and numerical computation.

3. How does Bellman optimality connect to greedy improvement in the chapter’s workflow?

Show answer
Correct answer: Choosing actions greedily with respect to value (or Q) provides a principled way to improve behavior toward optimality
Bellman optimality is used to justify why greedy choices with respect to current value estimates can lead toward optimal behavior.

4. After you implement a tiny finite MDP in Python, what is a primary reason to compute values numerically?

Show answer
Correct answer: To evaluate and compare policies using the defined transitions, rewards, and discount factor
With a concrete MDP specification, you can compute policy values, compare policies, and reason about improvements.

5. What does it typically indicate when the Markov property breaks for your chosen state representation?

Show answer
Correct answer: The state is missing information needed to predict future transitions and rewards given the current action
The chapter highlights validating assumptions and recognizing when the 'state' lacks necessary information, violating the Markov property.

Chapter 3: Planning with Dynamic Programming (DP)

So far you have framed reinforcement learning problems in terms of states, actions, rewards, and returns. In this chapter you will do something that feels almost “too easy” compared to learning from experience: you will solve an MDP by planning. Dynamic programming (DP) methods assume you already know the environment’s dynamics—its transition probabilities and reward function. When that model is available, DP becomes your most reliable tool for computing optimal policies in tabular settings.

The practical importance of DP goes beyond small toy problems. DP solutions become “gold baselines” you can trust. Later, when you build Monte Carlo and temporal-difference agents, you will use DP as a reference to detect bugs, validate reward scaling, and sanity-check exploration and learning rates. DP also teaches you the Bellman equations in an operational way: you will implement the backups and see how values propagate across the state space.

This chapter is organized around milestones you can implement directly in Python: (1) solve known MDPs with iterative policy evaluation; (2) implement policy iteration end-to-end; (3) implement value iteration and compare it with policy iteration; (4) make engineering decisions about stopping criteria and compute cost; and (5) use DP outputs as test oracles for later learning algorithms.

Practice note for Milestone: Solve known MDPs with iterative policy evaluation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement policy iteration end-to-end: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement value iteration and compare with policy iteration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Analyze convergence, complexity, and stopping criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Use DP solutions as gold baselines for later learning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Solve known MDPs with iterative policy evaluation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement policy iteration end-to-end: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement value iteration and compare with policy iteration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Analyze convergence, complexity, and stopping criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Use DP solutions as gold baselines for later learning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Why planning is different from learning in RL

Planning methods (DP) and learning methods (Monte Carlo, TD) often use similar equations, but they differ in what data they rely on. Planning assumes you know the full MDP model: the transition distribution P(s'|s,a) and expected reward R(s,a,s') (or R(s,a)). Learning assumes you can only interact with the environment to gather samples. That single difference changes your workflow and your failure modes.

In planning, you compute expectations by summing over all next states. This gives low-variance updates, deterministic given the model, and clear convergence guarantees for discounted problems. In learning, you replace sums with samples; you trade bias/variance and must manage exploration. Practically, DP is what you do when the environment is small and known (gridworlds, board-game abstractions, inventory control with known dynamics), or when you want a reference solution for debugging.

DP also changes how you think about “data.” You do not collect trajectories; you perform sweeps over the state space. A sweep is simply: for each state, update its value using the current estimates of successor states. The chapter milestones reflect that mindset: you will start with iterative policy evaluation (evaluate a fixed policy via repeated sweeps), then add improvement to get policy iteration, and finally compress evaluation and improvement into value iteration.

  • Engineering judgment: DP is ideal for tabular MDPs where you can enumerate states/actions. If you cannot, DP becomes infeasible and you move to approximate methods (which you will later).
  • Common mistake: Trying to run DP with only a simulator and no transition table. Without a model, you are doing learning, not planning.

By the end of this chapter you should be able to implement DP solvers as small, testable modules: a function to compute Bellman expectations for a policy, another to greedify a policy w.r.t. values, and wrappers for policy/value iteration with clear stopping tolerances.

Section 3.2: Iterative policy evaluation with sweeps

Iterative policy evaluation is your first milestone: solve known MDPs with iterative policy evaluation. Given a policy π(a|s) and a discount γ, the state-value function is defined by the Bellman expectation equation:

Vπ(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ Vπ(s') ]

In tabular DP you compute this by repeated sweeps. Start with V(s)=0 (or any initialization), then for each state apply the right-hand side using the current V. After enough sweeps, values converge to for discounted continuing tasks and for episodic tasks with proper terminal handling.

There are two classic sweep styles:

  • Synchronous (Jacobi): compute a new table V_new from the old V, then replace at the end of the sweep. Easier to reason about and test.
  • In-place (Gauss–Seidel): update V(s) immediately, so later states in the sweep see fresher values. Often converges faster, but results can depend on state ordering.

Practical workflow: represent the model as nested dictionaries or arrays: for each (s,a), store a list of transitions (prob, s_next, reward, done). Then policy evaluation is a tight loop over states and actions. Track Δ = max_s |V_new(s)-V(s)| after each sweep and stop when Δ < θ (a small tolerance). This is your first introduction to stopping criteria that are principled yet engineering-driven.

Common mistakes include forgetting to zero out the bootstrap term when transitioning into terminal states (treat terminal as having value 0), mixing up expected reward vs sampled reward, and failing to normalize policy probabilities. A good habit is to unit-test one state’s update by hand on a tiny MDP before running full sweeps.

Section 3.3: Policy improvement and policy iteration

Policy iteration is the second milestone: implement policy iteration end-to-end. It combines two alternating steps: (1) evaluate the current policy to get , then (2) improve the policy by acting greedily with respect to that value function.

To improve a policy, you compute action-values under the current value function:

Qπ(s,a) = Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ Vπ(s') ]

Then set the new policy to choose actions that maximize . In deterministic form, π_new(s) = argmax_a Qπ(s,a). In stochastic form, you can split probability mass among ties. This is not a cosmetic detail: consistent tie-breaking improves reproducibility and can prevent oscillations in symmetric MDPs.

The key theoretical property is the policy improvement theorem: if you greedify w.r.t. , you get a policy that is no worse (and often better) in every state. Practically, policy iteration tends to converge in few improvement steps because evaluation makes the policy “fully informed” before changing it.

Implementation tips:

  • Exact vs truncated evaluation: You rarely need evaluation to machine precision. Use iterative evaluation with a tolerance and/or a fixed number of sweeps (modified policy iteration). This reduces compute dramatically.
  • Stopping condition: Stop when the policy is stable (no action changes in any state) or when improvements stop changing values meaningfully.
  • Data structures: Represent policy as an array of action probabilities per state, even if you start deterministic. This makes evaluation code uniform and simplifies later extensions.

As an engineering milestone, once policy iteration works, you effectively have a “solver” for any tabular MDP with a known model. Save the resulting V* and π* to disk; you will later compare learning curves of TD methods against these optima.

Section 3.4: Value iteration and the role of max operator

Value iteration is the third milestone: implement value iteration and compare with policy iteration. Instead of fully evaluating a policy and then improving it, value iteration applies the Bellman optimality operator directly to the value function:

V(s) ← max_a Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ V(s') ]

The max is the defining feature. It turns the expectation backup into an optimality backup: you are no longer computing values for a fixed policy, you are pushing V toward V*. After convergence, you extract a greedy policy: π*(s)=argmax_a Σ_{s'} P(s'|s,a)[R+γV*(s')].

In practice, value iteration often requires more sweeps than policy iteration but each sweep is simpler—there is no separate evaluation loop nested inside the algorithm. If your evaluation step in policy iteration is expensive (many sweeps per improvement), value iteration can be faster overall. If the MDP is small and evaluation is cheap, policy iteration may converge in fewer outer iterations and be easier to debug.

Two practical comparisons you should run:

  • Sweep count: Compare total backups (state-action-next-state computations) rather than just iterations. This is a fairer measure of cost.
  • Policy quality vs time: Value iteration produces increasingly good greedy policies as V improves. You can stop early and still get a usable policy, which is useful when you need an approximate plan quickly.

Common mistake: updating V(s) using a policy probability distribution during value iteration. If you include π(a|s) you are performing policy evaluation, not optimal control. Another mistake is extracting the policy only once at the end but forgetting to handle ties consistently; in symmetric environments, different tie-breakers can produce different (but still optimal) policies.

Section 3.5: Convergence intuition, contraction mappings, tolerances

DP works because the Bellman operators for discounted MDPs are contractions. Informally, a contraction shrinks differences between value functions every time you apply the operator. For discount γ with 0 ≤ γ < 1, both the policy evaluation operator and the optimality operator T* contract in the max norm by at most a factor of γ. This is the mathematical reason iterative updates converge instead of bouncing around.

Why this matters in code: you need stopping criteria that balance runtime and accuracy. The standard approach is to monitor Δ = max_s |V_new(s)-V(s)|. When Δ becomes small, your value function is close to a fixed point. A useful engineering rule is to choose θ based on the reward scale. If rewards are in [-1, 1], tolerances like 1e-6 can be overkill; 1e-4 or 1e-3 is often enough for stable greedy policies.

Also consider the effect of γ. When γ is close to 1, contraction is weak and convergence slows; you will need more sweeps or looser tolerances. This is not a bug—your MDP is valuing far-future outcomes, so information must propagate farther through the state graph.

  • Complexity accounting: One sweep in policy evaluation is O(|S||A||S'|) if you iterate over all next states. With sparse transitions (common in gridworlds), count only non-zero transitions to keep it practical.
  • Stopping for control: For policy iteration, you can stop evaluation early (few sweeps) as long as the subsequent greedy improvement is stable. This is a standard trade-off in real implementations.

This section completes the milestone on convergence, complexity, and stopping criteria: your goal is not just to “get convergence,” but to choose a tolerance and iteration budget that gives correct decisions at reasonable cost.

Section 3.6: DP pitfalls: stochasticity, tie-breaking, and evaluation

DP is deterministic given a model, but implementations still fail in subtle ways. The most common pitfall is mishandling stochasticity. If your transition model lists multiple outcomes for the same action, you must sum over all of them, multiplying by their probabilities. Skipping low-probability outcomes or forgetting to normalize probabilities can quietly distort values and produce wrong policies that still “look plausible.”

Tie-breaking is another practical issue. In many environments, multiple actions can be equally good within tolerance. If you use argmax without care, tiny floating-point differences can cause the chosen action to flip between runs or between iterations, making policy stability checks unreliable. A robust approach is: compute all actions within ε of the max (e.g., 1e-12), then either pick the smallest-index action (deterministic) or distribute probability uniformly among ties (stochastic but controlled). Record the convention and keep it consistent.

Evaluation also means more than “compute V.” To use DP solutions as gold baselines (the final milestone), you should verify them in at least two ways:

  • Bellman residual: After convergence, compute the maximum Bellman error (residual) for V under or T*. Large residuals indicate a bug or insufficient sweeps.
  • Rollout check: Simulate episodes using the derived policy in the environment (even if you have the model) and confirm the empirical return matches V(s0) within sampling error.

Finally, be careful with terminal states and episodic tasks. A common mistake is to continue discounting past termination, effectively giving terminal states outgoing transitions. In your transition table, mark terminal transitions with done=True and ensure the backup uses γ * V(s') only when not done.

When DP is implemented with these details in mind, it becomes a dependable reference implementation. Later chapters will introduce sampling noise and function approximation; having a rock-solid DP baseline will keep you grounded when learning curves behave unexpectedly.

Chapter milestones
  • Milestone: Solve known MDPs with iterative policy evaluation
  • Milestone: Implement policy iteration end-to-end
  • Milestone: Implement value iteration and compare with policy iteration
  • Milestone: Analyze convergence, complexity, and stopping criteria
  • Milestone: Use DP solutions as gold baselines for later learning
Chapter quiz

1. What key assumption makes dynamic programming (DP) methods applicable for solving an MDP by planning?

Show answer
Correct answer: You already know the environment’s transition probabilities and reward function
DP planning relies on a known model (transition probabilities and reward function) to compute values and policies.

2. Why does the chapter emphasize DP solutions as “gold baselines” later in the course?

Show answer
Correct answer: They provide trusted reference outputs to validate and debug later learning agents
DP solutions can be used as test oracles to detect bugs and sanity-check choices like reward scaling and learning rates.

3. In operational terms, what does implementing Bellman backups in DP help you observe?

Show answer
Correct answer: How values propagate across the state space through repeated updates
DP makes the Bellman equations concrete by repeatedly backing up values and watching them propagate through states.

4. Which milestone best matches the goal of evaluating a fixed policy’s value function using DP?

Show answer
Correct answer: Solve known MDPs with iterative policy evaluation
Iterative policy evaluation computes the value function for a given (fixed) policy when the model is known.

5. When choosing a stopping criterion for DP planning, what trade-off is the chapter highlighting?

Show answer
Correct answer: Balancing convergence accuracy against computational cost
Stopping criteria are engineering decisions that trade more computation for tighter convergence (more accurate values/policies).

Chapter 4: Model-Free Prediction and Control (MC and TD)

So far you have seen how to solve tabular Markov decision processes (MDPs) when you can compute expectations from a known transition model. In many real problems you do not get that luxury: you only get experience. This chapter moves into the model-free setting, where learning happens from sampled trajectories (episodes or continuing streams) produced by interacting with the environment.

We will build two families of methods that form the backbone of practical reinforcement learning: Monte Carlo (MC) methods that learn from complete returns, and temporal-difference (TD) methods that learn by bootstrapping from existing value estimates. You will implement MC prediction to estimate from episodes, implement TD(0) and compare its bias/variance trade-offs against MC, use n-step returns as a bridge between them, and then transition from prediction to control by implementing SARSA. Finally, you will create an experiment harness to compare learning curves and make engineering decisions based on evidence rather than intuition.

  • Practical outcome: you will be able to train and evaluate tabular agents from raw experience without a model.
  • Engineering outcome: you will learn how to stabilize learning with step-size schedules, exploration choices, and clean experimental comparisons.

Throughout, keep the workflow consistent: (1) define a policy and value representation, (2) generate experience, (3) compute targets, (4) update estimates, (5) evaluate and iterate. Small implementation details—how you store episodes, how you seed randomness, how you average returns—matter a lot in model-free RL.

Practice note for Milestone: Estimate Vπ with Monte Carlo methods from episodes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement TD(0) and compare bias/variance vs MC: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Use n-step returns to bridge MC and TD: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement SARSA for on-policy control: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Build an experiment harness to compare learning curves: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Estimate Vπ with Monte Carlo methods from episodes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement TD(0) and compare bias/variance vs MC: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Use n-step returns to bridge MC and TD: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement SARSA for on-policy control: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Model-free setting and sampling trajectories

Section 4.1: Model-free setting and sampling trajectories

In the model-free setting, you do not know P(s'|s,a) or R(s,a) as tables you can query. Instead, you sample transitions by stepping through the environment: (s_t, a_t, r_{t+1}, s_{t+1}). Your algorithms must estimate values from these samples. This changes both the math (expectations become sample averages) and the engineering (you must build data structures to record trajectories and compute returns reliably).

Two experience regimes are common. In episodic tasks, interaction ends at a terminal state; you can store the entire episode and compute full returns. In continuing tasks, there may be no terminal state; you often use discounted returns and online updates. This chapter focuses on episodic sampling first (for Monte Carlo), then moves to online updates (for TD and control).

  • Trajectory: a list of time-indexed transitions; typically store states, actions, rewards, and dones.
  • Return: G_t = r_{t+1} + γ r_{t+2} + …, computed from the sampled rewards.
  • Policy evaluation goal: estimate Vπ(s) = Eπ[G_t | s_t=s] or Qπ(s,a) from data generated by policy π.

Common mistakes here are off-by-one reward indexing (mixing r_t vs r_{t+1}), forgetting to reset the environment between episodes, and accidentally using a different policy for acting than the one you claim to evaluate. When debugging, print the first episode’s transitions and manually compute one return to validate your bookkeeping.

Practical workflow: write a function generate_episode(env, policy, seed=None) that returns a structured episode object. Keep it deterministic under a seed, because reproducibility is essential when you later build a learning-curve harness.

Section 4.2: Monte Carlo prediction and exploring starts

Section 4.2: Monte Carlo prediction and exploring starts

Monte Carlo (MC) prediction estimates values using complete sampled returns from episodes. The key idea is simple: if you observe many returns following visits to a state, the average return approximates Vπ(s). This directly supports the milestone: estimate with Monte Carlo methods from episodes.

Implementation pattern (first-visit MC): for each episode, compute G_t backwards, and for each state’s first occurrence in that episode, update an average. The incremental mean update is numerically stable and memory efficient: V[s] ← V[s] + (1/N[s]) (G − V[s]). You can also use a constant step size α to track non-stationarity, but in stationary tabular problems the sample mean is a strong baseline.

  • First-visit vs every-visit: every-visit uses all visits in an episode (more updates, sometimes more variance); first-visit reduces correlated samples.
  • When MC shines: unbiased targets (the full return) and simple code; but it needs complete episodes and can have high variance.

MC control (improving a policy from sampled returns) introduces an exploration challenge: you must ensure every state-action pair you care about is visited. One classical idea is exploring starts: begin episodes from random state-action pairs so all pairs are reachable and sampled. This is often unrealistic in real environments but very useful pedagogically and in simulators. In practice, you more often use an ε-soft policy (random actions with small probability) to guarantee ongoing exploration.

Common mistakes: forgetting that MC is on-policy unless you add importance sampling; mixing evaluation and improvement logic without enough exploration; and evaluating a changing policy with returns generated by older policies (non-stationary data). To keep things clean, start with pure prediction: fix π, generate episodes with π, and only then consider control.

Section 4.3: Temporal-difference learning and bootstrapping

Section 4.3: Temporal-difference learning and bootstrapping

Temporal-difference (TD) methods update estimates from incomplete experience by combining a sampled reward with a bootstrap estimate of the future. TD(0) for state values uses the one-step target r_{t+1} + γ V(s_{t+1}). This enables online learning: you can update after every step without waiting for the episode to end. This section supports the milestone: implement TD(0) and compare bias/variance vs MC.

The essential difference is the target. MC uses G_t, a high-variance but unbiased estimate of Vπ(s_t). TD(0) uses r_{t+1} + γ V(s_{t+1}), which depends on current estimates and is therefore biased early on, but usually has much lower variance and learns faster from limited data.

  • TD(0) update: V(s_t) ← V(s_t) + α [r_{t+1} + γ V(s_{t+1}) − V(s_t)]
  • Terminal handling: if s_{t+1} is terminal, treat V(s_{t+1}) = 0 (or do not bootstrap beyond the terminal).

Bootstrapping is a feature and a risk. It is a feature because it propagates information quickly (a single reward can influence many preceding states via repeated updates). It is a risk because poor estimates can reinforce themselves if learning rates and exploration are mismanaged. In tabular settings TD(0) is typically stable, but you should still watch for divergence symptoms: values exploding, learning curves oscillating wildly, or results changing drastically with tiny hyperparameter tweaks.

Practical comparison experiment: fix a policy π, run MC prediction and TD(0) prediction on the same environment for a fixed number of episodes, and measure error versus a reference value (from dynamic programming or a very large MC run). You should observe MC improving slowly but steadily, and TD(0) improving sooner with some bias early on.

Section 4.4: TD error, learning rates, and stability heuristics

Section 4.4: TD error, learning rates, and stability heuristics

The TD error is the learning signal that powers TD prediction and many control algorithms: δ_t = r_{t+1} + γ V(s_{t+1}) − V(s_t) (or the analogous form for Q). Interpreting δ correctly helps you debug: positive δ means “better than expected,” negative means “worse than expected.” Logging the distribution of δ over time is often more informative than logging raw values.

Choosing the learning rate α is the most important engineering decision in tabular TD. With a constant α, learning can remain adaptive but may not fully converge; with a decaying α_t(s)=1/N(s), convergence is more likely in stationary problems but early learning may be slower. For control, constant α is common because the policy changes over time (the target is non-stationary).

  • Heuristic 1: start with α in [0.05, 0.5] for small tabular problems, then tune by looking for smooth improvement without persistent oscillation.
  • Heuristic 2: use separate step-size schedules per state (or per state-action) when visitation is highly imbalanced.
  • Heuristic 3: clip rewards (or scale them) if the environment’s reward magnitude makes updates too aggressive.

Stability is also affected by episode length and discount γ. With γ near 1, targets have longer effective horizons and variance increases; you may need smaller α. Another common mistake is mishandling terminal states: if you bootstrap from terminal values accidentally, you can leak value back into states that should not receive it.

This is where an experiment harness becomes indispensable: run multiple seeds, track mean and confidence intervals, and prefer conclusions that hold across seeds. A single lucky run can make an unstable configuration look “good.”

Section 4.5: n-step methods and eligibility trace intuition

Section 4.5: n-step methods and eligibility trace intuition

MC and TD(0) are endpoints on a spectrum. n-step methods bridge them by using targets that look ahead n steps with sampled rewards and then bootstrap: G_t^{(n)} = r_{t+1} + γ r_{t+2} + … + γ^{n-1} r_{t+n} + γ^n V(s_{t+n}). This supports the milestone: use n-step returns to bridge MC and TD. Setting n=1 yields TD(0); letting n reach the episode end yields MC.

Practically, n-step methods often learn faster than TD(0) because they incorporate more real rewards before bootstrapping, while still updating earlier than full MC. The trade-off is implementation complexity: you need a sliding window (queue) of recent transitions, and you must handle episode termination carefully when fewer than n steps remain.

  • Implementation tip: store recent (s, r) pairs in a deque; when its length exceeds n, compute and apply the n-step update for the oldest state.
  • Termination: after done=True, keep flushing the deque by performing the remaining updates with truncated returns (no bootstrap beyond terminal).

Eligibility traces provide intuition for why n-step updates help. Instead of waiting to update a state until it becomes the “oldest” in a window, traces assign each visited state a decaying credit that determines how strongly new TD errors update it. You can think of this as blending many n-step returns at once, weighting longer-horizon credit by a parameter λ (trace decay). Even if you do not implement full TD(λ) yet, keep the mental model: recent states get more credit; earlier states get some credit, but less. This intuition becomes crucial when you move to function approximation and deep RL later.

Section 4.6: On-policy control with SARSA and epsilon-greedy

Section 4.6: On-policy control with SARSA and epsilon-greedy

Prediction estimates values for a fixed policy; control improves the policy. SARSA is the classic on-policy TD control algorithm for learning action values Q(s,a) while behaving according to the same (exploratory) policy it is learning. This section ties directly to the milestone: implement SARSA for on-policy control.

SARSA’s update uses the quintuple (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}): Q(s_t,a_t) ← Q(s_t,a_t) + α [r_{t+1} + γ Q(s_{t+1}, a_{t+1}) − Q(s_t,a_t)]. The next action a_{t+1} is sampled from the current behavior policy, so the method learns the value of what it actually does—including exploratory moves.

Exploration is typically implemented with ε-greedy: with probability ε choose a random action, otherwise choose an action that maximizes Q(s,·). Engineering judgment is required when scheduling ε. A constant ε (e.g., 0.1) keeps exploring forever; a decaying schedule (e.g., linear or exponential decay) shifts from exploration to exploitation. In small tabular tasks, a simple schedule like ε_t = max(ε_min, ε_0 * decay^episode) is usually sufficient.

  • Common mistake: selecting a_{t+1} greedily during the update while still acting ε-greedy. That silently changes the algorithm into something closer to Q-learning.
  • Diagnostics: track episode return, episode length, and the fraction of random actions taken; these reveal whether learning is stuck due to too little or too much exploration.

Finally, build an experiment harness to compare learning curves across configurations (this satisfies the milestone about comparing curves). Standardize: same number of episodes, same evaluation protocol, multiple seeds, and consistent metrics. A minimal harness runs train(agent, env, episodes, seed), returns per-episode rewards, then aggregates mean and variability across seeds. With this in place, you can make grounded decisions such as “TD(0) reaches a useful policy in fewer episodes than MC control” or “a slower ε decay prevents premature convergence.”

Chapter milestones
  • Milestone: Estimate Vπ with Monte Carlo methods from episodes
  • Milestone: Implement TD(0) and compare bias/variance vs MC
  • Milestone: Use n-step returns to bridge MC and TD
  • Milestone: Implement SARSA for on-policy control
  • Milestone: Build an experiment harness to compare learning curves
Chapter quiz

1. Which statement best contrasts Monte Carlo (MC) prediction with TD(0) prediction in the model-free setting?

Show answer
Correct answer: MC updates value estimates from complete episode returns, while TD(0) updates by bootstrapping from the next state's current estimate.
MC learns from full returns after episodes, while TD(0) bootstraps from existing value estimates using one-step targets.

2. When comparing MC and TD(0), which bias/variance trade-off is most consistent with the chapter summary?

Show answer
Correct answer: MC tends to have lower bias but higher variance; TD(0) tends to have higher bias but lower variance due to bootstrapping.
Bootstrapping in TD(0) introduces bias but often reduces variance compared to full-return MC targets.

3. How do n-step returns "bridge" MC and TD methods?

Show answer
Correct answer: They interpolate between using only full returns (large n) and using bootstrapped one-step targets (n = 1).
n-step returns use n rewards then bootstrap, ranging from TD(0) at n=1 to MC as n approaches the episode length.

4. What change marks the transition from prediction to control in this chapter?

Show answer
Correct answer: Learning action-value estimates and improving behavior using an on-policy control method like SARSA.
Control means improving the policy; SARSA is an on-policy method that learns (and uses) action values to do so.

5. Why does the chapter emphasize building an experiment harness to compare learning curves?

Show answer
Correct answer: To make engineering decisions based on evidence by running clean, comparable evaluations (e.g., with careful randomness and averaging).
A harness enables fair comparisons across algorithms/settings and highlights how implementation details (seeding, averaging, storage) affect results.

Chapter 5: Off-Policy Learning and Q-Learning Mastery

This chapter is the pivot from “learning from what you do” to “learning from what you (or someone else) did.” That shift—off-policy learning—is what makes Q-learning so powerful in practice: you can collect data with an exploratory behavior policy, then learn a greedy control policy that targets optimal behavior. The engineering challenge is that this freedom introduces new failure modes: unstable updates, maximization bias, misleading evaluation, and brittle exploration schedules.

We will work toward a concrete milestone: implement tabular Q-learning for control in a small discrete MDP, then strengthen it with systematic tuning and diagnostics. Along the way, you will learn how to recognize and mitigate maximization bias (the reason vanilla Q-learning can overestimate values), apply basic off-policy evaluation with importance sampling, and debug common problems like divergence, reward hacking, and sparse rewards. By the end, you should be able to build a reliable training loop, choose exploration and learning-rate schedules on purpose, and evaluate progress without fooling yourself.

A practical mindset for this chapter: treat the algorithm as only one part of the system. Your results will depend just as much on exploration design, reproducibility (random seeds, logging), and validation (baselines, ablations) as on the update rule itself.

Practice note for Milestone: Implement tabular Q-learning for control: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Understand and mitigate maximization bias: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Apply off-policy evaluation basics and importance sampling: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Tune exploration schedules and learning rates systematically: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Diagnose failures (divergence, reward hacking, sparse rewards): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Implement tabular Q-learning for control: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Understand and mitigate maximization bias: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Apply off-policy evaluation basics and importance sampling: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Tune exploration schedules and learning rates systematically: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Diagnose failures (divergence, reward hacking, sparse rewards): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: On-policy vs off-policy and why it matters

Section 5.1: On-policy vs off-policy and why it matters

In on-policy methods (like SARSA), the agent learns the value of the policy it actually follows. If you behave with an b5-greedy policy, your targets and updates reflect that same exploratory behavior. This creates a tight feedback loop: what you do is what you learn. On-policy learning is often more stable in certain settings because the data distribution matches the policy being evaluated.

Off-policy methods break that link. You collect experience using a behavior policy bc(a|s) (often exploratory), but you learn about a different target policy c0(a|s) (often greedy). Q-learning is the canonical example: it can behave b5-greedily, but its update targets the greedy action under current estimates. This difference matters because it enables sample reuse, learning from historical logs, and training from mixed data sources (self-play, demonstrations, replay buffers later in DQN).

However, off-policy learning introduces distribution mismatch: you may be updating values for actions rarely taken by the behavior policy. In tabular settings, Q-learning is still guaranteed to converge under classic assumptions (sufficient exploration, decaying step sizes), but in practice you can still get slow learning or misleading evaluation if you do not track what data you are actually collecting.

  • Practical outcome: You will maintain two mental models: behavior policy for data collection, target policy for control objectives.
  • Common mistake: Reporting performance of the behavior policy (exploratory) when you care about the greedy target policy. Always evaluate both when debugging.
  • Engineering judgment: If safety or constraints matter, off-policy learning lets you keep behavior conservative while still improving a target policy offline.

For the milestone “implement tabular Q-learning for control,” this section clarifies why the learned Q-function is aimed at optimal control even while you deliberately inject exploration into behavior.

Section 5.2: Q-learning update rule and greedy target

Section 5.2: Q-learning update rule and greedy target

Tabular Q-learning learns an action-value function Q(s,a) by bootstrapping from the best next action according to current estimates. Given a transition (s, a, r, s', done), the core update is:

target = r + b3 * (1 - done) * max_a' Q(s', a')
Q(s,a) b5 Q(s,a) + b1 * (target - Q(s,a))

The “greedy target” max_a' Q(s',a') is what makes Q-learning off-policy: even if your behavior chose a random action, the update assumes the next step will act optimally (greedily). Implementation-wise, the milestone is to produce a minimal, correct control loop: initialize Q to zeros, run episodes, select actions with an exploration policy, step the environment, update Q, and repeat.

Two details make or break real implementations. First, handle terminal transitions correctly by masking out the bootstrap term when done=True; otherwise values can leak beyond episode boundaries. Second, ensure your state representation is stable and hashable (e.g., integers for discrete states, tuples for grid positions). Silent bugs often come from indexing Q incorrectly or mixing up (s,a) keys.

  • Workflow: (1) choose b1,b3,b5 schedule; (2) run training for N episodes; (3) periodically evaluate greedy policy (no exploration); (4) save Q table and plots.
  • Common mistake: Evaluating with b5>0 and concluding learning failed because performance is noisy. Use a separate evaluation run with b5=0.
  • Practical outcome: You can now learn an approximately optimal policy in discrete MDPs without knowing transition dynamics.

When your Q-learning loop is correct, you should see learning curves that improve steadily in simple environments. If not, Section 5.6 will show how to diagnose whether the issue is exploration, step size, reward scaling, or a bug in episode termination.

Section 5.3: Exploration strategies: epsilon schedules and soft policies

Section 5.3: Exploration strategies: epsilon schedules and soft policies

Q-learning needs sufficient exploration: if you never try an action, you cannot learn its consequences. The simplest strategy is b5-greedy: with probability b5 pick a random action; otherwise pick argmax_a Q(s,a). The tuning problem is that b5 controls two competing goals—exploration (discovering better actions) and exploitation (refining what you already believe is good). A fixed b5 is rarely ideal across the whole training run.

A systematic approach is to treat b5 as a schedule. Common choices:

  • Linear decay: b5 starts high (e.g., 1.0) and decays to a floor (e.g., 0.05) over K steps/episodes.
  • Exponential decay: b5_t = max(b5_min, b5_0 * decay^t), smoother early decay.
  • Two-phase: maintain high exploration until returns become non-trivial, then decay more aggressively.

Soft policies like Boltzmann/softmax exploration choose actions proportional to exp(Q(s,a)/c4), where c4 (temperature) controls randomness. Softmax can be more sensitive to Q scale: if rewards are large, it becomes nearly greedy unless you increase c4. b5-greedy is often more robust to reward scaling, which is why it is popular for tabular control.

Learning rate b1 interacts with exploration. With high b5 and high b1, Q-values may fluctuate because you update from many random transitions. With low b5 and high b1, you may overfit early estimates and stop exploring too soon. A practical tuning workflow is to sweep a small grid: b1 in {0.1, 0.3, 0.5} and b5_min in {0.01, 0.05, 0.1}, keeping b3 fixed, then choose based on median performance over multiple random seeds.

Practical outcome: you can now “tune exploration schedules and learning rates systematically” instead of guessing. This matters most in sparse-reward tasks, where you may need sustained exploration long enough to stumble into success at least a few times.

Section 5.4: Maximization bias, Double Q-learning idea

Section 5.4: Maximization bias, Double Q-learning idea

The max operator in Q-learning is both the feature and the bug. When Q estimates are noisy (early training, stochastic environments), taking max_a' Q(s',a') tends to select actions whose values are overestimated by chance. This creates maximization bias: Q-learning can systematically overestimate action values, which can slow learning or lead to brittle policies that chase phantom advantages.

You can often spot maximization bias by comparing estimated Q-values with empirical returns. For example, the greedy policy may appear to have high predicted value, but actual evaluation returns lag far behind, and Q-values keep inflating. Another sign is instability when you increase exploration or when rewards are stochastic: the max over noisy samples becomes more optimistic.

Double Q-learning reduces this bias by decoupling action selection from action evaluation. Maintain two tables, Q1 and Q2. On each update, randomly pick one to update. If updating Q1, use Q1 to select the best next action, but use Q2 to evaluate it:

target = r + b3 * Q2(s', argmax_a' Q1(s',a'))

Symmetrically when updating Q2. Intuitively, the noise in Q1 and Q2 is less likely to agree on the same overly-optimistic action, so overestimation is reduced. This section fulfills the milestone “understand and mitigate maximization bias” with a concrete tool that stays tabular and easy to implement.

  • Engineering judgment: If you see persistent overestimation or unstable greedy performance, implement Double Q-learning before trying elaborate exploration tricks.
  • Common mistake: Averaging Q1 and Q2 for action selection during training can be fine, but be consistent in evaluation (e.g., use argmax over (Q1+Q2)).

Double Q-learning is also the conceptual bridge to Deep RL: Double DQN applies the same idea with neural networks and a target network. Understanding it here will make later DQN stabilization techniques feel motivated rather than magical.

Section 5.5: Off-policy evaluation: importance sampling basics

Section 5.5: Off-policy evaluation: importance sampling basics

Off-policy control answers “how do I learn a better policy from exploratory data?” Off-policy evaluation (OPE) answers “how good is a target policy, using data collected by a different behavior policy?” This matters when evaluation in the real environment is expensive, risky, or impossible (think logged interaction data).

The basic tool is importance sampling. Suppose you have episodes generated by behavior policy bc, but you want to estimate the expected return of target policy c0. For a trajectory c4 = (s0,a0,s1,a1,...), you weight its return G(c4) by the likelihood ratio:

w(c4) = a0_t c0(a_t|s_t) / bc(a_t|s_t)

Then an estimator is the average of w(c4) * G(c4). In practice, raw (ordinary) importance sampling can have huge variance because w(c4) can explode when bc assigns small probability to actions that c0 prefers. A common mitigation is weighted (normalized) importance sampling, dividing by the sum of weights to reduce variance (at the cost of some bias).

Practical guidance for tabular experiments:

  • Track bc(a|s) explicitly. With b5-greedy behavior, bc is known: the greedy action gets probability (1-b5) + b5/|A| and others get b5/|A| (ties need careful handling).
  • Do not attempt OPE if c0 takes actions that bc almost never takes. Your estimates will be dominated by a few trajectories.
  • Prefer per-decision importance sampling (weighting step-by-step) when episodes are long; it can reduce variance relative to full-trajectory products.

This section meets the milestone “apply off-policy evaluation basics and importance sampling” and gives you a reality check: off-policy learning is easy to write down, but off-policy evaluation is statistically delicate. When you later use replay buffers, the same mismatch issues appear—only disguised as “training instability.”

Section 5.6: Debugging RL: metrics, baselines, and ablations

Section 5.6: Debugging RL: metrics, baselines, and ablations

When Q-learning fails, it rarely fails loudly. You might get a flat learning curve, oscillating returns, or a policy that “works” in training but collapses at evaluation. A disciplined debugging approach turns RL from folklore into engineering. This section targets the milestone “diagnose failures (divergence, reward hacking, sparse rewards)” with a toolkit you can reuse throughout the course.

Log the right metrics. At minimum: episodic return (train and greedy-eval), episode length, b5 value, mean/max Q-value, TD error statistics (mean absolute TD), and fraction of actions that are greedy vs random. If Q-values explode while returns do not improve, suspect too-large b1, a bug in done handling, or maximization bias in a stochastic environment.

Use baselines. Always compare against (1) random policy, (2) a simple heuristic if available, and (3) SARSA with the same exploration schedule. If Q-learning underperforms SARSA consistently, the issue is often aggressive bootstrapping or overestimation; Double Q-learning is a targeted fix.

Ablations identify what matters. Change one thing at a time: fixed b5 vs decayed b5, b1 sweep, reward clipping/scaling, removing terminal masking, swapping max with Double Q target. Ablations prevent you from attributing gains to the wrong feature.

Common failure patterns and fixes:

  • Divergence/instability: reduce b1, ensure terminal masking, verify state indexing, add Double Q-learning, or slow b5 decay so exploration reduces premature commitment to bad values.
  • Reward hacking: the agent exploits loopholes in reward design (e.g., looping to collect small rewards). Fix by redefining reward to match the true objective, adding termination penalties, or shaping carefully while checking that the optimal policy is unchanged.
  • Sparse rewards: returns stay near zero because success is rarely discovered. Fix with longer exploration (higher b5_min), optimistic initialization of Q, curriculum (easier starts), or shaping rewards, then validate with ablations to ensure you did not change the task.

Practical outcome: you can now treat your tabular Q-learning implementation as a testbed. If you can’t make it learn reliably in a small MDP with solid logs and baselines, scaling to DQN later will be painful. Conversely, if you can diagnose and fix these issues here, you will recognize the same patterns when function approximation enters the picture.

Chapter milestones
  • Milestone: Implement tabular Q-learning for control
  • Milestone: Understand and mitigate maximization bias
  • Milestone: Apply off-policy evaluation basics and importance sampling
  • Milestone: Tune exploration schedules and learning rates systematically
  • Milestone: Diagnose failures (divergence, reward hacking, sparse rewards)
Chapter quiz

1. What is the key idea that makes Q-learning an off-policy control method in this chapter?

Show answer
Correct answer: It can learn a greedy target policy from data collected by a different exploratory behavior policy
Off-policy learning allows using trajectories from one policy (behavior) to improve another (target), enabling exploratory data collection while learning a greedy control policy.

2. Why is maximization bias a concern in vanilla Q-learning?

Show answer
Correct answer: The max over noisy value estimates can systematically overestimate action values
Taking a max over imperfect estimates tends to pick overestimated actions, creating optimistic value bias that can mislead learning.

3. In the chapter’s framing, what is the purpose of importance sampling in basic off-policy evaluation?

Show answer
Correct answer: To reweight returns so evaluation reflects a target policy despite data coming from a different behavior policy
Importance sampling adjusts for the mismatch between behavior and target policies when estimating performance from off-policy data.

4. Which practice best matches the chapter’s guidance on tuning exploration schedules and learning rates?

Show answer
Correct answer: Choose and adjust schedules systematically rather than treating them as afterthoughts
The chapter emphasizes deliberate, systematic tuning of exploration and learning-rate schedules as part of building a reliable training system.

5. Which set best reflects the chapter’s highlighted failure modes to diagnose when training goes wrong?

Show answer
Correct answer: Divergence, reward hacking, and sparse rewards
The chapter specifically calls out divergence, reward hacking, and sparse rewards as common RL failure modes needing diagnosis and mitigation.

Chapter 6: Deep Reinforcement Learning with DQN and Beyond

So far, you have learned how to compute or learn value functions in tabular settings: when states (and actions) can be enumerated and stored in arrays. Chapter 6 is where we cross the bridge to modern reinforcement learning: we replace tables with neural networks and train them with data generated by interaction. This shift unlocks problems with large or continuous observation spaces (images, sensor vectors, game RAM), but it also introduces new failure modes: instability, divergence, sensitivity to hyperparameters, and subtle bugs that “kind of work” but never reach strong performance.

This chapter is organized as a practical engineering workflow. First, you will build a neural Q-function approximator in PyTorch. Then you will assemble the canonical DQN training system: replay buffer plus target network. Next, you will implement correct targets and a training loop that respects the off-policy nature of Q-learning. Finally, you will stabilize learning with normalization, clipping, and schedules, and you will evaluate generalization and robustness across random seeds. We end with a map of what comes next (policy gradients and actor-critic), so you understand DQN’s place in the broader deep RL ecosystem.

  • Milestone: Build a neural Q-function approximator in PyTorch
  • Milestone: Train a DQN with replay buffer and target network
  • Milestone: Stabilize learning with normalization, clipping, and schedules
  • Milestone: Evaluate generalization and robustness across seeds
  • Milestone: Map next steps: policy gradients and actor-critic overview

As you read, keep a mental model of what can go wrong: (1) targets that depend on the same network you are updating, (2) correlated samples causing destructive gradient steps, (3) exploding Q-values and gradients, and (4) evaluation that accidentally measures “lucky seeds” instead of reliable learning. DQN is not hard because it is mathematically complex; it is hard because it is an engineered system with interacting parts.

Practice note for Milestone: Build a neural Q-function approximator in PyTorch: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Train a DQN with replay buffer and target network: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Stabilize learning with normalization, clipping, and schedules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Evaluate generalization and robustness across seeds: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Map next steps: policy gradients and actor-critic overview: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Build a neural Q-function approximator in PyTorch: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone: Train a DQN with replay buffer and target network: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Function approximation and why tabular methods break

Section 6.1: Function approximation and why tabular methods break

Tabular methods assume you can store one value per state (or per state-action pair). That works in Gridworld or small MDPs, but it collapses in realistic environments: the “state” might be a 84×84×4 image stack, a 17-dimensional robot observation vector, or a continuous pose. Even if the state space is technically finite, it is too large to visit often enough for a table to fill in. The practical symptom is sparse coverage: most entries are never updated, and generalization is impossible.

Function approximation solves this by learning a parameterized function, typically a neural network, that maps observations to values. For DQN, the network approximates the action-value function: Q(s, a; θ). Instead of a table row per state, we compute a vector of Q-values for all discrete actions given a state. This is the first milestone: build a neural Q-function approximator in PyTorch that takes an observation tensor and returns a Q-value per action.

  • Discrete actions: DQN outputs one Q-value per action. If you have continuous actions, you are in actor-critic territory (later in this chapter).
  • Input preprocessing: normalize observations (e.g., divide pixel values by 255.0) and ensure consistent shapes (batch dimension first).
  • Output shape: for batch size B and action count A, output should be (B, A).

Common mistakes: forgetting to add the batch dimension (leading to shape bugs), using a network with the wrong output size, and accidentally detaching tensors (breaking gradients). Another conceptual pitfall is expecting deep networks to “fix” poor exploration. Function approximation generalizes, but it cannot learn good values for regions of state space the agent never visits. So as you move beyond tables, you must treat data collection and exploration strategy as first-class design choices.

Section 6.2: DQN components: network, replay buffer, target network

Section 6.2: DQN components: network, replay buffer, target network

A working DQN agent is more than “Q-learning with a neural net.” The algorithm relies on three components that collectively stabilize learning: an online Q-network, a replay buffer, and a target network. Your second milestone is to train a DQN with replay buffer and target network, which means wiring these pieces together correctly and resisting the temptation to simplify away the parts that prevent divergence.

Online Q-network (parameters θ) is the model you update with gradient descent. It chooses actions via an exploration policy, usually ε-greedy: with probability ε pick a random action, otherwise pick argmax_a Q(s,a;θ). Keep action selection in torch.no_grad() and switch the network to eval mode if you use dropout or batch norm (most DQN baselines avoid them, but engineering reality varies).

Replay buffer stores transitions (s, a, r, s′, done). Sampling uniformly from this buffer breaks the short-term correlation present in sequential experience and improves data efficiency: each interaction can be replayed many times. Practical buffer details matter: store observations as compact dtypes (e.g., uint8 for pixels), pre-allocate arrays for speed, and track a circular pointer for constant-time insertion.

Target network (parameters θ⁻) is a periodically updated copy of the online network. Targets for Q-learning depend on max_a′ Q(s′,a′), and if you compute those targets from the same network you are updating, you get a moving target that can destabilize training. The target network “slows down” target drift. In practice you either (1) hard-update every N steps (θ⁻ ← θ), or (2) soft-update with Polyak averaging (θ⁻ ← τθ + (1-τ)θ⁻).

  • Engineering judgement: start learning only after the buffer has a minimum number of steps (warm-up), otherwise early updates overfit tiny, unrepresentative data.
  • Batching: sample batches of transitions and convert to tensors on the correct device (CPU/GPU) once per update.
  • Done handling: store done explicitly; mistakes here silently corrupt targets.

When DQN “does nothing,” the usual culprits are (a) ε never decays (agent stays random), (b) target network never updates (stale learning signal), or (c) replay sampling returns wrong shapes/types (training runs but gradients are nonsense). Treat these components as testable units: write small assertions for shapes, dtypes, and value ranges before you chase hyperparameters.

Section 6.3: Loss functions, target computation, and training loops

Section 6.3: Loss functions, target computation, and training loops

At the heart of DQN is the temporal-difference target for Q-learning. For a batch of transitions, the target for each sample is:

y = r + γ * (1 - done) * max_a' Q_target(s', a'; θ⁻)

The prediction is the online network’s Q-value for the action actually taken:

q = Q_online(s; θ).gather(1, a)

Then you minimize a regression loss between q and y. This section is where many implementations quietly go wrong, because the code “runs” even when the math is subtly incorrect. First, ensure a is shaped (B,1) and is a long tensor so that gather selects the correct column. Second, compute max_a' under torch.no_grad() so gradients do not flow into the target network or the target computation.

The training loop is typically organized around environment steps rather than episodes: collect one step of experience, push it to replay, and every k steps perform one or more gradient updates. This decoupling matters because updates are off-policy: you are learning about the greedy policy while behaving ε-greedily. A practical pattern is:

  • Interact with env for 1 step using ε-greedy action selection.
  • Store transition in replay buffer.
  • If replay has enough samples and step % update_freq == 0: sample a batch and optimize.
  • If step % target_update_freq == 0: update target network.

Common mistakes: forgetting the (1 - done) mask (causes bootstrapping past terminal states), mixing float and int dtypes (especially for rewards and dones), and using the wrong discount γ (e.g., 0.0 or 1.0 by accident). Another subtle issue is reward timing: ensure the reward r corresponds to the transition from s to s′ via action a, not the next step’s reward due to a misplaced assignment.

Practical outcome: by the end of this section, you should be able to train a basic DQN on a small discrete-control benchmark (like CartPole) and see learning curves improve beyond random. It may not be perfectly stable yet—that is the point of the next section.

Section 6.4: Practical stability: gradient clipping, reward scaling, Huber loss

Section 6.4: Practical stability: gradient clipping, reward scaling, Huber loss

Deep RL is notoriously sensitive because small distribution shifts in experience can create large changes in targets, which creates large gradients, which changes the policy, which changes the data. Your third milestone is to stabilize learning with normalization, clipping, and schedules—simple tools that dramatically reduce “training collapse.”

Huber loss (a.k.a. Smooth L1) is the default loss for many DQN implementations because it behaves like MSE near zero error but becomes linear for large errors, reducing the influence of outliers. In PyTorch, this is torch.nn.SmoothL1Loss. If you see occasional huge TD errors (common early in training), Huber often prevents a single batch from destroying progress.

Gradient clipping caps the norm (or value) of gradients before the optimizer step. A typical choice is global norm clipping at 10.0. Clipping is not a substitute for fixing bugs, but it is a reliable guardrail when Q-values spike. Implement it right before optimizer.step() with torch.nn.utils.clip_grad_norm_.

Reward scaling and clipping control the magnitude of targets. In some domains (notably Atari), clipping rewards to [-1, 1] historically improved stability by limiting target variance. In other domains, reward clipping can remove meaningful scale. Engineering judgement: if rewards have heavy tails or occasional large spikes, consider clipping; if the reward scale encodes task difficulty (e.g., continuous control costs), prefer normalization or careful learning rates.

  • Schedules: decay ε from 1.0 to a smaller value over many steps; consider learning-rate schedules if training oscillates.
  • Observation normalization: for vector inputs, track running mean/variance and normalize; for pixels, scale to [0,1].
  • Target update frequency: too frequent can destabilize; too infrequent can slow learning. Start with a few hundred to a few thousand environment steps.

Common mistakes: applying normalization inconsistently between current states and next states, clipping rewards in evaluation (evaluation should reflect the true environment return), and using too-large batch sizes that hide instability until it is severe. Stability is not one magic trick; it is the cumulative effect of several small, defensible choices.

Section 6.5: Extensions: Double DQN, Dueling DQN, Prioritized Replay

Section 6.5: Extensions: Double DQN, Dueling DQN, Prioritized Replay

Once vanilla DQN works, you can usually get noticeable gains with three extensions. Each addresses a known weakness of the baseline, and each is worth understanding even if you do not implement all of them immediately.

Double DQN reduces overestimation bias. In standard DQN, the max operator uses the same values to select and evaluate actions, which tends to inflate Q-values. Double DQN splits selection and evaluation: choose the best action using the online network, but evaluate it using the target network:

a* = argmax_a Q_online(s',a;θ)
y = r + γ*(1-done)* Q_target(s', a*; θ⁻)

This change is small in code but often big in stability, especially when rewards are noisy.

Dueling DQN changes the network architecture to separately estimate a state value V(s) and an advantage A(s,a), then combine them into Q(s,a). Intuition: in many states, the choice of action barely matters; learning V(s) can be easier than learning each Q(s,a) independently. Practically, dueling heads can speed learning on tasks with many similar actions.

Prioritized Experience Replay samples transitions with probability proportional to their TD error magnitude, focusing updates on “surprising” experiences. This can improve sample efficiency, but it introduces bias that must be corrected with importance-sampling weights. Engineering judgement: prioritized replay adds complexity (data structures, hyperparameters α and β). Use it when you are bottlenecked on environment interactions, not when you are still debugging the basic loop.

  • Robustness across seeds: when you add extensions, rerun multiple seeds. Improvements that appear on one seed may vanish on average.
  • Keep baselines: always compare against your working DQN; extensions should not break learning.

This section pairs naturally with the milestone of evaluating generalization and robustness across seeds. Deep RL curves are noisy; good practice is to report mean and variability across seeds, and to save checkpoints so you can inspect regressions when an “improvement” silently worsens stability.

Section 6.6: Where to go next: REINFORCE, PPO, and actor-critic intuition

Section 6.6: Where to go next: REINFORCE, PPO, and actor-critic intuition

DQN is a value-based method for discrete actions. Its core idea is: learn Q(s,a), then act greedily (with exploration). This breaks down when actions are continuous (you cannot argmax over infinite actions) or when stochastic policies are preferable (e.g., for entropy, exploration, or multi-modal behavior). The next family of methods optimizes the policy directly.

REINFORCE is the simplest policy-gradient algorithm: sample trajectories, compute returns, and push policy parameters in the direction that increases log-probability of actions that led to higher return. It is conceptually clean but high variance; in practice you almost always add a baseline (often a value function) to reduce variance.

Actor-critic combines an actor (policy) and a critic (value function). The critic estimates how good states or actions are, and the actor uses that estimate to update the policy with lower variance than pure REINFORCE. This is the bridge between what you already know (value learning) and policy optimization.

PPO (Proximal Policy Optimization) is a widely used actor-critic method that adds a practical constraint: policy updates should not change the policy too much in one step. PPO’s “clipped objective” is an engineering solution to a common deep RL issue: large, destructive policy updates. Compared to DQN, PPO tends to be easier to apply across many continuous-control tasks, but it is on-policy and can be less sample-efficient.

  • How to choose: use DQN-style methods for discrete actions and when replay-based sample efficiency matters; consider PPO/actor-critic for continuous actions or when stable policy learning is the priority.
  • What carries over: seeds, logging, normalization, clipping, and careful evaluation remain essential.

Practical outcome: by finishing this chapter, you should be able to implement and debug a DQN training system, stabilize it with standard tricks, and evaluate it responsibly across seeds. You should also understand why the field often transitions from DQN to actor-critic methods as tasks become more complex—and you will be ready to make that transition without losing the disciplined workflow you built here.

Chapter milestones
  • Milestone: Build a neural Q-function approximator in PyTorch
  • Milestone: Train a DQN with replay buffer and target network
  • Milestone: Stabilize learning with normalization, clipping, and schedules
  • Milestone: Evaluate generalization and robustness across seeds
  • Milestone: Map next steps: policy gradients and actor-critic overview
Chapter quiz

1. What key capability does Chapter 6 gain by replacing tabular value functions with neural networks?

Show answer
Correct answer: Handling large or continuous observation spaces by learning from interaction-generated data
Neural function approximation enables learning in large/continuous observation spaces (e.g., images, sensor vectors) using data from interaction.

2. Why does the canonical DQN system include both a replay buffer and a target network?

Show answer
Correct answer: To reduce instability from correlated samples and from targets depending on the same network being updated
Replay buffers help break correlations in samples, and target networks help keep training targets from shifting too rapidly with the online network.

3. Which training-loop concern is emphasized as essential when implementing DQN targets?

Show answer
Correct answer: Ensuring the training loop respects the off-policy nature of Q-learning
DQN is based on Q-learning, which is off-policy, so targets and updates must be implemented in a way consistent with off-policy learning.

4. Which set best matches the stabilization techniques highlighted in the chapter?

Show answer
Correct answer: Normalization, clipping, and schedules
The chapter specifically calls out normalization, clipping, and schedules as practical tools to stabilize deep RL training.

5. What is the main reason the chapter stresses evaluating across multiple random seeds?

Show answer
Correct answer: To check generalization and robustness rather than accidentally measuring performance from “lucky seeds”
Deep RL can be sensitive to randomness; evaluating across seeds helps ensure learning is reliable rather than a fluke.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.