Reinforcement Learning — Beginner
Build RL agents from first principles to deep Q-networks in Python.
Reinforcement learning (RL) is the toolkit for building agents that learn by trial and error—optimizing decisions over time rather than predicting labels. This book-style course takes you from first principles to a working Deep Q-Network (DQN) pipeline in Python, with a strong emphasis on clarity, clean experimentation, and practical debugging. You’ll start with the basic agent–environment loop and end with modern deep RL building blocks like replay buffers and target networks.
The progression is intentional: you’ll first learn to think in RL (returns, policies, value), then formalize problems as Markov decision processes (MDPs), then solve small MDPs with planning (dynamic programming), then move to model-free learning (Monte Carlo and temporal-difference), then master Q-learning, and finally scale up to deep function approximation with DQN.
Across the six chapters, you’ll repeatedly practice the same professional workflow: define the task, choose an algorithm, implement a minimal baseline, run controlled experiments, and diagnose failure modes. You’ll learn why RL can be unstable, how to interpret learning curves, and how to make improvements that are actually attributable to a change (not randomness).
Each chapter reads like a compact technical chapter with milestones and sub-sections that build directly on what came before. You won’t jump into deep networks until you can explain what a value function is and why the Bellman target looks the way it does. By the time you reach DQN, concepts like off-policy learning, bootstrapping, and maximization bias will already be familiar—so the deep RL “magic” becomes a set of understandable design choices.
If you’re ready to learn reinforcement learning step by step—without skipping fundamentals—start here and follow the path chapter by chapter. Register free to track your progress, or browse all courses to compare related learning paths.
By the end, you’ll be able to implement and evaluate core RL algorithms, choose appropriate baselines, and reason about why an agent is (or isn’t) learning. Most importantly, you’ll have a durable framework for learning more advanced methods like PPO and actor-critic approaches with confidence.
Senior Machine Learning Engineer (Reinforcement Learning)
Dr. Maya Chen is a senior machine learning engineer specializing in reinforcement learning for decision-making systems and simulation-based optimization. She has shipped RL prototypes in robotics and operations research workflows and mentors engineers on reliable experimentation and evaluation.
Reinforcement Learning (RL) is the branch of machine learning focused on learning by interaction: an agent takes actions in an environment, receives feedback (rewards), and gradually improves its behavior. This chapter builds the mental model you will use for the rest of the course: translating messy real tasks into a clean agent–environment loop, computing returns with discounting, defining policies and value functions, and developing the “Bellman intuition” that underpins dynamic programming, Monte Carlo, temporal-difference learning, and Deep Q-Networks.
Along the way, you’ll also establish practical engineering habits—reproducible seeds, lightweight logging, and simple plots—so you can tell whether an improvement is real or accidental. You will set up a small Python toolkit and implement a first toy environment with a Gymnasium-style API, because the fastest way to understand RL is to run the loop yourself and observe learning curves.
By the end of this chapter you should be able to look at a task (playing a game, controlling a robot, recommending content, scheduling jobs) and express it as an MDP-style loop with states, actions, rewards, and termination conditions. That framing is the foundation for everything that follows: dynamic programming on tabular MDPs, Monte Carlo and TD learning, SARSA and Q-learning, and eventually DQN.
Practice note for Milestone: Translate real tasks into agent–environment loops: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Compute returns and understand discounting: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Define policies, value functions, and the Bellman intuition: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Set up your RL Python toolkit and first toy environment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Establish evaluation habits (seeds, logging, plots): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Translate real tasks into agent–environment loops: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Compute returns and understand discounting: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Define policies, value functions, and the Bellman intuition: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Set up your RL Python toolkit and first toy environment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Supervised learning gives you a fixed dataset of input–output pairs and asks you to learn a mapping. Reinforcement learning rarely has that luxury. Instead, the data you learn from is generated by your current policy. When you change the policy, you change the distribution of future experiences. This feedback loop makes RL feel less like “fit a function” and more like “run an experiment, observe outcomes, revise the experiment.”
Three differences matter most in practice. First, RL deals with delayed consequences: an action can hurt you much later, so you need a way to assign credit over time. Second, RL involves exploration: you must try actions that may be worse short-term to discover better long-term strategies. Third, RL learning is often non-stationary because your agent’s behavior changes, and with it the data you see.
Throughout this course, you will keep returning to one question: “What does the agent control, what does the environment control, and what data will my current policy generate?” Framing the loop correctly is the first milestone and the biggest multiplier on your debugging speed.
An RL problem is an agent–environment loop. At time step t, the environment provides a state (or observation) st. The agent chooses an action at. The environment transitions to a new state st+1 and emits a reward rt+1. This repeats until the episode ends (or forever in continuing tasks). Your job is to make this loop explicit for any real task you encounter.
In many textbooks, the environment is formalized as a Markov Decision Process (MDP): transitions depend only on the current state and action, not the full history. That “Markov property” is not a philosophical statement—it’s a modeling choice. If your state omits important context (e.g., velocity in control, inventory in scheduling), your environment will appear non-Markov and learning will be harder. When in doubt, enrich the state with the missing information or use history-based observations later.
Milestone: translate a task into the loop by writing down (1) what the agent observes, (2) what it can do, (3) when it receives reward, and (4) what ends an episode. If you cannot define these unambiguously, you don’t yet have an RL problem—you have a product requirement. The act of specifying the loop is often where hidden assumptions and edge cases surface.
RL tasks come in two flavors. Episodic tasks have a natural terminal state: a game ends, a robot falls, a trading day closes. Continuing tasks run indefinitely: thermostat control, server autoscaling, ongoing recommendation. This distinction matters because it changes how we measure success and how we compute the objective the agent should maximize.
The central quantity is the return, the total reward accumulated from a time step onward. In episodic settings, the undiscounted return is often written as Gt = rt+1 + rt+2 + … + rT. In continuing or long-horizon settings, we typically use a discount factor γ (gamma), with 0 ≤ γ < 1, to define Gt = rt+1 + γ rt+2 + γ² rt+3 + …. Discounting makes the infinite sum finite and emphasizes near-term outcomes.
Milestone: compute returns and understand discounting. In code, that means being able to take a trajectory of rewards and compute Gt for each step, and being able to explain how changing γ changes what “good behavior” means. This is the first step toward value functions and Bellman equations.
A policy tells the agent what to do. In the simplest case, it’s a function mapping states to actions. More generally, a stochastic policy maps states to a distribution over actions: π(a|s). Stochasticity is not just for randomness; it can be essential for exploration and for optimality in some environments.
A value function measures how good it is to be in a state (or take an action in a state) under a policy. The state-value function is Vπ(s) = E[Gt | st=s, π]. The action-value function is Qπ(s,a) = E[Gt | st=s, at=a, π]. Values are predictions of future return; policies use those predictions to choose actions.
The “Bellman intuition” is that values satisfy a self-consistency relationship: today’s value equals today’s reward plus discounted value of tomorrow. In expectation form, V(s) is roughly E[r + γ V(s′)]. This idea powers dynamic programming (value iteration, policy iteration) when you know the model, and it also motivates learning rules like temporal-difference updates when you don’t.
Milestone: define policies and value functions clearly enough that you could implement tabular dynamic programming for a tiny MDP. You will do exactly that in the next chapters: first with known transitions (planning), then with sampled experience (learning).
Reward design is where RL meets reality. A reward is a proxy for your true objective, and proxies can be exploited. If you reward “move forward,” an agent may spin its wheels in place to trigger sensors. If you reward “time alive,” it may learn to hide. The goal is not to design a perfect reward—it’s to design one that leads to the behavior you want under optimization pressure.
Credit assignment is the challenge of figuring out which earlier actions caused a later outcome. Sparse terminal rewards (win/lose) are clean but hard to learn from; dense shaping rewards provide guidance but risk changing the optimal policy if not done carefully. A practical approach is iterative: start with a reward that matches the task definition, add minimal shaping that reflects progress, then verify the agent isn’t finding loopholes.
This course will repeatedly revisit exploration and tuning. For now, adopt an engineering mindset: write down the intended behavior, list plausible failure modes, and add diagnostics (episode length, reward components, success rate). These habits make later algorithms—SARSA, Q-learning, DQN—much easier to evaluate honestly.
RL becomes concrete when you can run the loop. Use Python 3.10+ and create an isolated environment (venv or conda). Install core packages you will rely on throughout: NumPy for arrays, matplotlib for plots, and a Gymnasium-compatible API for environments (either Gymnasium itself or simple custom environments that mimic its interface). Even if you later move to deep learning, the same structure persists.
A Gymnasium-style environment typically exposes reset(seed=...) and step(action). reset returns an initial observation (and info). step returns (obs, reward, terminated, truncated, info). Your agent loop collects these into trajectories. For a first toy environment, implement something tiny you can reason about, such as a 1D random-walk to a goal. Keep the state space small so you can print tables and verify behavior.
Milestone: establish evaluation habits. From day one, run at least 3–10 seeds for any claim you make, and keep your plots simple and comparable. RL variance is real; disciplined evaluation is how you avoid “hero demos” that can’t be reproduced. With this toolkit and loop in place, you are ready to implement your first baseline agents in the next chapter and improve them methodically.
1. Which framing best captures a real task as a reinforcement learning problem in this chapter?
2. What does it mean to compute a return with discounting?
3. In the chapter’s core definitions, what is a policy?
4. What is the purpose of developing “Bellman intuition” in this chapter?
5. Which practice best helps you tell whether a reported improvement is real rather than accidental, according to the chapter?
Reinforcement learning becomes much easier to reason about once you commit to a clean mathematical model of “what the world is” and “what the agent can do.” This chapter builds that model using Markov Decision Processes (MDPs) and then derives the Bellman equations, which are the backbone of dynamic programming, policy evaluation, and most value-based RL algorithms.
In practice, the biggest upgrade you will make as an RL engineer is to move from an intuitive description (“the agent navigates a grid and gets points”) to a precise specification you can compute with: a state space, an action set, transition probabilities, a reward function, and a discount factor. Once you have those pieces, you can compute the value of a policy numerically, compare policies, and reason about how greedy improvement leads toward optimal behavior.
This chapter’s milestones map directly onto a real workflow: formalize the RL task as an MDP; derive Bellman expectation equations for V and Q; use Bellman optimality to justify greedy improvement; implement a tiny MDP in Python and compute values; and finally, learn to recognize when your “state” is missing information and the Markov property breaks.
Practice note for Milestone: Formalize an RL task as an MDP: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Derive Bellman expectation equations for V and Q: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Use Bellman optimality to reason about greedy improvement: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement a small finite MDP and compute values numerically: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Validate assumptions and spot when Markov property breaks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Formalize an RL task as an MDP: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Derive Bellman expectation equations for V and Q: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Use Bellman optimality to reason about greedy improvement: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement a small finite MDP and compute values numerically: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Validate assumptions and spot when Markov property breaks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
An MDP is the standard formal object for many RL problems. You specify a tuple (S, A, P, R, γ):
To hit the first milestone (“formalize an RL task as an MDP”), start from the environment description and force yourself to write each component explicitly. For a gridworld, S might be cell coordinates; A could be {up, down, left, right}; P could be deterministic or “slippery” with probability of unintended motion; R might be -1 per step and +10 at goal; and γ might be 0.99 to prioritize reaching the goal quickly while still valuing future outcomes.
Engineering judgment: keep the model as small as possible while staying faithful to the task. Huge state spaces make computation expensive; overly compressed state spaces can break the Markov property (covered in Section 2.2). Common mistakes include mixing up rewards and returns (reward is immediate; return is accumulated), or forgetting to define what happens at terminal states. A practical habit is to document terminal conditions and transitions out of terminal states (usually self-loop with zero reward) so your evaluation code doesn’t contain hidden exceptions.
Finally, interpret γ as a knob: lower values make learning and evaluation more stable and myopic; values close to 1 make the agent far-sighted but can increase variance and slow convergence. In continuing tasks without natural termination, γ is also what keeps the sum of future rewards finite.
The Markov property is the assumption that makes MDPs work: given the current state s and action a, the distribution of the next state and reward is independent of the past. Informally: the state must summarize everything from history that matters for the future.
This is not just theory; it is a design constraint on your state representation. If your state omits a variable that affects dynamics or reward, your value estimates will look noisy, inconsistent, or “non-learnable,” because the same state-action pair leads to different outcomes depending on hidden history. For example, imagine a robot with a battery that drains. If the state includes only position but not battery level, then the transition probability to “stuck” depends on how long you have been moving. That violates Markov, and your MDP model is wrong even if your code runs.
To validate Markov assumptions (a key milestone), run a simple diagnostic: collect many transitions from the same observed state-action pair and check if the empirical distribution of next states/rewards changes when you condition on earlier history. In tabular environments you can do this with counts; in continuous environments you can approximate with bins or a learned classifier that tries to predict next outcomes from history beyond s. If history improves prediction, your state is missing information.
When your state is well-formed, the rest of this chapter becomes powerful: Bellman equations become correct identities, and dynamic programming becomes a reliable tool rather than a fragile approximation.
Once the task is an MDP and you have a policy π(a|s), you can define value functions that measure long-term desirability. The state-value is Vπ(s) = E[ Gt | St=s ], where the return Gt = Rt+1 + γRt+2 + γ2Rt+3 + . The action-value is Qπ(s,a) = E[ Gt | St=s, At=a ].
The Bellman expectation equations (milestone) express a self-consistency: today’s value equals immediate reward plus discounted value of tomorrow, under the same policy. For Vπ:
Vπ(s) = 3 3 π(a|s) 3 3 P(s'|s,a) 3 [ R(s,a,s') + γ Vπ(s') ]
For Qπ:
Qπ(s,a) = 3 3 P(s'|s,a) 3 [ R(s,a,s') + γ 3 3 π(a'|s') Qπ(s',a') ]
These are not “algorithms” yet; they are equations you can use to build algorithms. In tabular settings, you can solve them as a linear system (exact policy evaluation) or iterate them (iterative policy evaluation). Engineering judgment: iteration is usually simpler and numerically robust, but you need stopping criteria (e.g., max change < 1e-8) and you must handle terminal states carefully (no future value beyond terminal).
Common mistakes: mixing up which expectation belongs where (policy averages over actions; environment averages over next states), and forgetting that rewards can depend on (s,a,s'). When debugging, print a single state’s Bellman target and verify each term: probabilities sum to 1, rewards match your environment, and discounting is applied exactly once per step.
Expectation equations evaluate a fixed policy. Optimality equations answer the harder question: what is the best achievable value? Define V* (s) = max over policies π of Vπ(s), and similarly Q* (s,a). The Bellman optimality equations replace the policy expectation with a max:
V* (s) = maxa 3 3 P(s'|s,a) 3 [ R(s,a,s') + γ V*(s') ]
Q* (s,a) = 3 3 P(s'|s,a) 3 [ R(s,a,s') + γ maxa' Q*(s',a') ]
This section supports the milestone “use Bellman optimality to reason about greedy improvement.” The core idea is policy improvement: if you have an estimate of action-values Qπ, you can build a new policy π' that is greedy with respect to it: π'(s) = argmaxa Qπ(s,a). Under standard conditions, this greedy policy is guaranteed to be no worse than π (and often better). This is the theoretical engine behind policy iteration and value iteration.
Practical takeaway: “greedy” is safe as an improvement step when your Q-values are accurate. In learning algorithms, Q-values are approximate and noisy, so pure greedy action selection can get stuck early. That is why later chapters introduce exploration strategies (e.g., ε-greedy). But the target you learn from in many methods still uses a greedy max term 4 that is Bellman optimality showing up inside learning updates.
bellman_expectation_target vs bellman_optimality_target helps prevent subtle bugs.When you internalize that Bellman optimality is simply “do the best action now and assume best behavior later,” you can reason about optimal strategies without running simulation, as long as your MDP model is correct.
In small, finite MDPs, you can compute values numerically and use them as ground truth. This is a milestone because it builds intuition and gives you a debugging reference before you move to sampling-based learning. The key is to represent dynamics as matrices and vectors.
Suppose |S|=n and |A|=m. For each action a, define an n 7n transition matrix Pa where Pa[i,j] = P(sj|si,a). Define an expected reward vector ra of length n with ra[i] = 3 P(s'|si,a) R(si,a,s'). For a stochastic policy π, the induced transition matrix is Pπ = 3a diag(π( a|s)) Pa (conceptually: average transitions by action probabilities). Then the Bellman expectation equation becomes a linear system:
Vπ = rπ + γ Pπ Vπ 7 7 7 (I - γ Pπ) Vπ = rπ
In Python, you can solve this with numpy.linalg.solve for exact evaluation (assuming the matrix is well-conditioned). Or you can perform iterative evaluation: initialize V=0 and repeatedly apply V r + γ P V until convergence. Iteration is closer to how dynamic programming is usually implemented and mirrors later TD learning.
A practical exercise is to implement a 3-state MDP (start, middle, terminal) with two actions (safe vs risky), compute Vπ for a chosen policy, then compute V* by value iteration using the optimality update. Seeing the numbers stabilize over iterations builds the mental model you will rely on later when values are learned from samples instead of computed from known P and R.
Not every RL problem fits neatly into an MDP with a small, clean state. Many realistic environments are partially observable: the agent receives observations o rather than true states s, and those observations may not satisfy the Markov property. This is where many projects fail quietly 4 the code runs, but learning stalls or produces brittle behavior.
Classic examples: a noisy sensor, a hidden mode (e.g., “slippery floor” today vs not), or a delayed effect (a door that closes 5 seconds after you push a button). If the observation doesn’t contain the hidden variable, two identical observations can require different actions depending on past events. Your model is now a POMDP (Partially Observable MDP), and the Bellman equations on observations are not valid identities.
How do you proceed in practice? First, decide whether you can restore Markov by state augmentation. Add variables you can compute from data: elapsed time, last action, last reward, velocity, inventory, cooldown timers. In environments built from game rules, it is often possible to expose the internal state directly and keep the problem as an MDP.
If augmentation is impossible (true hidden state), you need memory. Common approaches include stacking recent observations (e.g., last 4 frames), using a recurrent network (GRU/LSTM) to build a belief-like internal state, or maintaining an explicit Bayesian belief state when the model is known. Even in tabular settings, you can create a “history state” like (ot-1, ot) to approximate Markov behavior.
The key milestone here is not to avoid partial observability, but to spot it early. Once you can articulate exactly why the Markov property breaks, you can choose an engineering solution 4 augment, add memory, or accept approximation 4 instead of guessing why learning is unstable.
1. Which set of components is required to precisely specify an RL task as a Markov Decision Process (MDP) in this chapter?
2. What do the Bellman expectation equations enable you to do for a fixed policy?
3. How does Bellman optimality connect to greedy improvement in the chapter’s workflow?
4. After you implement a tiny finite MDP in Python, what is a primary reason to compute values numerically?
5. What does it typically indicate when the Markov property breaks for your chosen state representation?
So far you have framed reinforcement learning problems in terms of states, actions, rewards, and returns. In this chapter you will do something that feels almost “too easy” compared to learning from experience: you will solve an MDP by planning. Dynamic programming (DP) methods assume you already know the environment’s dynamics—its transition probabilities and reward function. When that model is available, DP becomes your most reliable tool for computing optimal policies in tabular settings.
The practical importance of DP goes beyond small toy problems. DP solutions become “gold baselines” you can trust. Later, when you build Monte Carlo and temporal-difference agents, you will use DP as a reference to detect bugs, validate reward scaling, and sanity-check exploration and learning rates. DP also teaches you the Bellman equations in an operational way: you will implement the backups and see how values propagate across the state space.
This chapter is organized around milestones you can implement directly in Python: (1) solve known MDPs with iterative policy evaluation; (2) implement policy iteration end-to-end; (3) implement value iteration and compare it with policy iteration; (4) make engineering decisions about stopping criteria and compute cost; and (5) use DP outputs as test oracles for later learning algorithms.
Practice note for Milestone: Solve known MDPs with iterative policy evaluation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement policy iteration end-to-end: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement value iteration and compare with policy iteration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Analyze convergence, complexity, and stopping criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Use DP solutions as gold baselines for later learning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Solve known MDPs with iterative policy evaluation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement policy iteration end-to-end: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement value iteration and compare with policy iteration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Analyze convergence, complexity, and stopping criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Use DP solutions as gold baselines for later learning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Planning methods (DP) and learning methods (Monte Carlo, TD) often use similar equations, but they differ in what data they rely on. Planning assumes you know the full MDP model: the transition distribution P(s'|s,a) and expected reward R(s,a,s') (or R(s,a)). Learning assumes you can only interact with the environment to gather samples. That single difference changes your workflow and your failure modes.
In planning, you compute expectations by summing over all next states. This gives low-variance updates, deterministic given the model, and clear convergence guarantees for discounted problems. In learning, you replace sums with samples; you trade bias/variance and must manage exploration. Practically, DP is what you do when the environment is small and known (gridworlds, board-game abstractions, inventory control with known dynamics), or when you want a reference solution for debugging.
DP also changes how you think about “data.” You do not collect trajectories; you perform sweeps over the state space. A sweep is simply: for each state, update its value using the current estimates of successor states. The chapter milestones reflect that mindset: you will start with iterative policy evaluation (evaluate a fixed policy via repeated sweeps), then add improvement to get policy iteration, and finally compress evaluation and improvement into value iteration.
By the end of this chapter you should be able to implement DP solvers as small, testable modules: a function to compute Bellman expectations for a policy, another to greedify a policy w.r.t. values, and wrappers for policy/value iteration with clear stopping tolerances.
Iterative policy evaluation is your first milestone: solve known MDPs with iterative policy evaluation. Given a policy π(a|s) and a discount γ, the state-value function is defined by the Bellman expectation equation:
Vπ(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ Vπ(s') ]
In tabular DP you compute this by repeated sweeps. Start with V(s)=0 (or any initialization), then for each state apply the right-hand side using the current V. After enough sweeps, values converge to Vπ for discounted continuing tasks and for episodic tasks with proper terminal handling.
There are two classic sweep styles:
V_new from the old V, then replace at the end of the sweep. Easier to reason about and test.V(s) immediately, so later states in the sweep see fresher values. Often converges faster, but results can depend on state ordering.Practical workflow: represent the model as nested dictionaries or arrays: for each (s,a), store a list of transitions (prob, s_next, reward, done). Then policy evaluation is a tight loop over states and actions. Track Δ = max_s |V_new(s)-V(s)| after each sweep and stop when Δ < θ (a small tolerance). This is your first introduction to stopping criteria that are principled yet engineering-driven.
Common mistakes include forgetting to zero out the bootstrap term when transitioning into terminal states (treat terminal as having value 0), mixing up expected reward vs sampled reward, and failing to normalize policy probabilities. A good habit is to unit-test one state’s update by hand on a tiny MDP before running full sweeps.
Policy iteration is the second milestone: implement policy iteration end-to-end. It combines two alternating steps: (1) evaluate the current policy to get Vπ, then (2) improve the policy by acting greedily with respect to that value function.
To improve a policy, you compute action-values under the current value function:
Qπ(s,a) = Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ Vπ(s') ]
Then set the new policy to choose actions that maximize Qπ. In deterministic form, π_new(s) = argmax_a Qπ(s,a). In stochastic form, you can split probability mass among ties. This is not a cosmetic detail: consistent tie-breaking improves reproducibility and can prevent oscillations in symmetric MDPs.
The key theoretical property is the policy improvement theorem: if you greedify w.r.t. Vπ, you get a policy that is no worse (and often better) in every state. Practically, policy iteration tends to converge in few improvement steps because evaluation makes the policy “fully informed” before changing it.
Implementation tips:
As an engineering milestone, once policy iteration works, you effectively have a “solver” for any tabular MDP with a known model. Save the resulting V* and π* to disk; you will later compare learning curves of TD methods against these optima.
Value iteration is the third milestone: implement value iteration and compare with policy iteration. Instead of fully evaluating a policy and then improving it, value iteration applies the Bellman optimality operator directly to the value function:
V(s) ← max_a Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ V(s') ]
The max is the defining feature. It turns the expectation backup into an optimality backup: you are no longer computing values for a fixed policy, you are pushing V toward V*. After convergence, you extract a greedy policy: π*(s)=argmax_a Σ_{s'} P(s'|s,a)[R+γV*(s')].
In practice, value iteration often requires more sweeps than policy iteration but each sweep is simpler—there is no separate evaluation loop nested inside the algorithm. If your evaluation step in policy iteration is expensive (many sweeps per improvement), value iteration can be faster overall. If the MDP is small and evaluation is cheap, policy iteration may converge in fewer outer iterations and be easier to debug.
Two practical comparisons you should run:
V improves. You can stop early and still get a usable policy, which is useful when you need an approximate plan quickly.Common mistake: updating V(s) using a policy probability distribution during value iteration. If you include π(a|s) you are performing policy evaluation, not optimal control. Another mistake is extracting the policy only once at the end but forgetting to handle ties consistently; in symmetric environments, different tie-breakers can produce different (but still optimal) policies.
DP works because the Bellman operators for discounted MDPs are contractions. Informally, a contraction shrinks differences between value functions every time you apply the operator. For discount γ with 0 ≤ γ < 1, both the policy evaluation operator Tπ and the optimality operator T* contract in the max norm by at most a factor of γ. This is the mathematical reason iterative updates converge instead of bouncing around.
Why this matters in code: you need stopping criteria that balance runtime and accuracy. The standard approach is to monitor Δ = max_s |V_new(s)-V(s)|. When Δ becomes small, your value function is close to a fixed point. A useful engineering rule is to choose θ based on the reward scale. If rewards are in [-1, 1], tolerances like 1e-6 can be overkill; 1e-4 or 1e-3 is often enough for stable greedy policies.
Also consider the effect of γ. When γ is close to 1, contraction is weak and convergence slows; you will need more sweeps or looser tolerances. This is not a bug—your MDP is valuing far-future outcomes, so information must propagate farther through the state graph.
O(|S||A||S'|) if you iterate over all next states. With sparse transitions (common in gridworlds), count only non-zero transitions to keep it practical.This section completes the milestone on convergence, complexity, and stopping criteria: your goal is not just to “get convergence,” but to choose a tolerance and iteration budget that gives correct decisions at reasonable cost.
DP is deterministic given a model, but implementations still fail in subtle ways. The most common pitfall is mishandling stochasticity. If your transition model lists multiple outcomes for the same action, you must sum over all of them, multiplying by their probabilities. Skipping low-probability outcomes or forgetting to normalize probabilities can quietly distort values and produce wrong policies that still “look plausible.”
Tie-breaking is another practical issue. In many environments, multiple actions can be equally good within tolerance. If you use argmax without care, tiny floating-point differences can cause the chosen action to flip between runs or between iterations, making policy stability checks unreliable. A robust approach is: compute all actions within ε of the max (e.g., 1e-12), then either pick the smallest-index action (deterministic) or distribute probability uniformly among ties (stochastic but controlled). Record the convention and keep it consistent.
Evaluation also means more than “compute V.” To use DP solutions as gold baselines (the final milestone), you should verify them in at least two ways:
V under Tπ or T*. Large residuals indicate a bug or insufficient sweeps.V(s0) within sampling error.Finally, be careful with terminal states and episodic tasks. A common mistake is to continue discounting past termination, effectively giving terminal states outgoing transitions. In your transition table, mark terminal transitions with done=True and ensure the backup uses γ * V(s') only when not done.
When DP is implemented with these details in mind, it becomes a dependable reference implementation. Later chapters will introduce sampling noise and function approximation; having a rock-solid DP baseline will keep you grounded when learning curves behave unexpectedly.
1. What key assumption makes dynamic programming (DP) methods applicable for solving an MDP by planning?
2. Why does the chapter emphasize DP solutions as “gold baselines” later in the course?
3. In operational terms, what does implementing Bellman backups in DP help you observe?
4. Which milestone best matches the goal of evaluating a fixed policy’s value function using DP?
5. When choosing a stopping criterion for DP planning, what trade-off is the chapter highlighting?
So far you have seen how to solve tabular Markov decision processes (MDPs) when you can compute expectations from a known transition model. In many real problems you do not get that luxury: you only get experience. This chapter moves into the model-free setting, where learning happens from sampled trajectories (episodes or continuing streams) produced by interacting with the environment.
We will build two families of methods that form the backbone of practical reinforcement learning: Monte Carlo (MC) methods that learn from complete returns, and temporal-difference (TD) methods that learn by bootstrapping from existing value estimates. You will implement MC prediction to estimate Vπ from episodes, implement TD(0) and compare its bias/variance trade-offs against MC, use n-step returns as a bridge between them, and then transition from prediction to control by implementing SARSA. Finally, you will create an experiment harness to compare learning curves and make engineering decisions based on evidence rather than intuition.
Throughout, keep the workflow consistent: (1) define a policy and value representation, (2) generate experience, (3) compute targets, (4) update estimates, (5) evaluate and iterate. Small implementation details—how you store episodes, how you seed randomness, how you average returns—matter a lot in model-free RL.
Practice note for Milestone: Estimate Vπ with Monte Carlo methods from episodes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement TD(0) and compare bias/variance vs MC: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Use n-step returns to bridge MC and TD: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement SARSA for on-policy control: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Build an experiment harness to compare learning curves: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Estimate Vπ with Monte Carlo methods from episodes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement TD(0) and compare bias/variance vs MC: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Use n-step returns to bridge MC and TD: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement SARSA for on-policy control: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
In the model-free setting, you do not know P(s'|s,a) or R(s,a) as tables you can query. Instead, you sample transitions by stepping through the environment: (s_t, a_t, r_{t+1}, s_{t+1}). Your algorithms must estimate values from these samples. This changes both the math (expectations become sample averages) and the engineering (you must build data structures to record trajectories and compute returns reliably).
Two experience regimes are common. In episodic tasks, interaction ends at a terminal state; you can store the entire episode and compute full returns. In continuing tasks, there may be no terminal state; you often use discounted returns and online updates. This chapter focuses on episodic sampling first (for Monte Carlo), then moves to online updates (for TD and control).
states, actions, rewards, and dones.G_t = r_{t+1} + γ r_{t+2} + …, computed from the sampled rewards.Vπ(s) = Eπ[G_t | s_t=s] or Qπ(s,a) from data generated by policy π.Common mistakes here are off-by-one reward indexing (mixing r_t vs r_{t+1}), forgetting to reset the environment between episodes, and accidentally using a different policy for acting than the one you claim to evaluate. When debugging, print the first episode’s transitions and manually compute one return to validate your bookkeeping.
Practical workflow: write a function generate_episode(env, policy, seed=None) that returns a structured episode object. Keep it deterministic under a seed, because reproducibility is essential when you later build a learning-curve harness.
Monte Carlo (MC) prediction estimates values using complete sampled returns from episodes. The key idea is simple: if you observe many returns following visits to a state, the average return approximates Vπ(s). This directly supports the milestone: estimate Vπ with Monte Carlo methods from episodes.
Implementation pattern (first-visit MC): for each episode, compute G_t backwards, and for each state’s first occurrence in that episode, update an average. The incremental mean update is numerically stable and memory efficient: V[s] ← V[s] + (1/N[s]) (G − V[s]). You can also use a constant step size α to track non-stationarity, but in stationary tabular problems the sample mean is a strong baseline.
MC control (improving a policy from sampled returns) introduces an exploration challenge: you must ensure every state-action pair you care about is visited. One classical idea is exploring starts: begin episodes from random state-action pairs so all pairs are reachable and sampled. This is often unrealistic in real environments but very useful pedagogically and in simulators. In practice, you more often use an ε-soft policy (random actions with small probability) to guarantee ongoing exploration.
Common mistakes: forgetting that MC is on-policy unless you add importance sampling; mixing evaluation and improvement logic without enough exploration; and evaluating a changing policy with returns generated by older policies (non-stationary data). To keep things clean, start with pure prediction: fix π, generate episodes with π, and only then consider control.
Temporal-difference (TD) methods update estimates from incomplete experience by combining a sampled reward with a bootstrap estimate of the future. TD(0) for state values uses the one-step target r_{t+1} + γ V(s_{t+1}). This enables online learning: you can update after every step without waiting for the episode to end. This section supports the milestone: implement TD(0) and compare bias/variance vs MC.
The essential difference is the target. MC uses G_t, a high-variance but unbiased estimate of Vπ(s_t). TD(0) uses r_{t+1} + γ V(s_{t+1}), which depends on current estimates and is therefore biased early on, but usually has much lower variance and learns faster from limited data.
V(s_t) ← V(s_t) + α [r_{t+1} + γ V(s_{t+1}) − V(s_t)]s_{t+1} is terminal, treat V(s_{t+1}) = 0 (or do not bootstrap beyond the terminal).Bootstrapping is a feature and a risk. It is a feature because it propagates information quickly (a single reward can influence many preceding states via repeated updates). It is a risk because poor estimates can reinforce themselves if learning rates and exploration are mismanaged. In tabular settings TD(0) is typically stable, but you should still watch for divergence symptoms: values exploding, learning curves oscillating wildly, or results changing drastically with tiny hyperparameter tweaks.
Practical comparison experiment: fix a policy π, run MC prediction and TD(0) prediction on the same environment for a fixed number of episodes, and measure error versus a reference value (from dynamic programming or a very large MC run). You should observe MC improving slowly but steadily, and TD(0) improving sooner with some bias early on.
The TD error is the learning signal that powers TD prediction and many control algorithms: δ_t = r_{t+1} + γ V(s_{t+1}) − V(s_t) (or the analogous form for Q). Interpreting δ correctly helps you debug: positive δ means “better than expected,” negative means “worse than expected.” Logging the distribution of δ over time is often more informative than logging raw values.
Choosing the learning rate α is the most important engineering decision in tabular TD. With a constant α, learning can remain adaptive but may not fully converge; with a decaying α_t(s)=1/N(s), convergence is more likely in stationary problems but early learning may be slower. For control, constant α is common because the policy changes over time (the target is non-stationary).
α in [0.05, 0.5] for small tabular problems, then tune by looking for smooth improvement without persistent oscillation.Stability is also affected by episode length and discount γ. With γ near 1, targets have longer effective horizons and variance increases; you may need smaller α. Another common mistake is mishandling terminal states: if you bootstrap from terminal values accidentally, you can leak value back into states that should not receive it.
This is where an experiment harness becomes indispensable: run multiple seeds, track mean and confidence intervals, and prefer conclusions that hold across seeds. A single lucky run can make an unstable configuration look “good.”
MC and TD(0) are endpoints on a spectrum. n-step methods bridge them by using targets that look ahead n steps with sampled rewards and then bootstrap: G_t^{(n)} = r_{t+1} + γ r_{t+2} + … + γ^{n-1} r_{t+n} + γ^n V(s_{t+n}). This supports the milestone: use n-step returns to bridge MC and TD. Setting n=1 yields TD(0); letting n reach the episode end yields MC.
Practically, n-step methods often learn faster than TD(0) because they incorporate more real rewards before bootstrapping, while still updating earlier than full MC. The trade-off is implementation complexity: you need a sliding window (queue) of recent transitions, and you must handle episode termination carefully when fewer than n steps remain.
(s, r) pairs in a deque; when its length exceeds n, compute and apply the n-step update for the oldest state.done=True, keep flushing the deque by performing the remaining updates with truncated returns (no bootstrap beyond terminal).Eligibility traces provide intuition for why n-step updates help. Instead of waiting to update a state until it becomes the “oldest” in a window, traces assign each visited state a decaying credit that determines how strongly new TD errors update it. You can think of this as blending many n-step returns at once, weighting longer-horizon credit by a parameter λ (trace decay). Even if you do not implement full TD(λ) yet, keep the mental model: recent states get more credit; earlier states get some credit, but less. This intuition becomes crucial when you move to function approximation and deep RL later.
Prediction estimates values for a fixed policy; control improves the policy. SARSA is the classic on-policy TD control algorithm for learning action values Q(s,a) while behaving according to the same (exploratory) policy it is learning. This section ties directly to the milestone: implement SARSA for on-policy control.
SARSA’s update uses the quintuple (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}): Q(s_t,a_t) ← Q(s_t,a_t) + α [r_{t+1} + γ Q(s_{t+1}, a_{t+1}) − Q(s_t,a_t)]. The next action a_{t+1} is sampled from the current behavior policy, so the method learns the value of what it actually does—including exploratory moves.
Exploration is typically implemented with ε-greedy: with probability ε choose a random action, otherwise choose an action that maximizes Q(s,·). Engineering judgment is required when scheduling ε. A constant ε (e.g., 0.1) keeps exploring forever; a decaying schedule (e.g., linear or exponential decay) shifts from exploration to exploitation. In small tabular tasks, a simple schedule like ε_t = max(ε_min, ε_0 * decay^episode) is usually sufficient.
a_{t+1} greedily during the update while still acting ε-greedy. That silently changes the algorithm into something closer to Q-learning.Finally, build an experiment harness to compare learning curves across configurations (this satisfies the milestone about comparing curves). Standardize: same number of episodes, same evaluation protocol, multiple seeds, and consistent metrics. A minimal harness runs train(agent, env, episodes, seed), returns per-episode rewards, then aggregates mean and variability across seeds. With this in place, you can make grounded decisions such as “TD(0) reaches a useful policy in fewer episodes than MC control” or “a slower ε decay prevents premature convergence.”
1. Which statement best contrasts Monte Carlo (MC) prediction with TD(0) prediction in the model-free setting?
2. When comparing MC and TD(0), which bias/variance trade-off is most consistent with the chapter summary?
3. How do n-step returns "bridge" MC and TD methods?
4. What change marks the transition from prediction to control in this chapter?
5. Why does the chapter emphasize building an experiment harness to compare learning curves?
This chapter is the pivot from “learning from what you do” to “learning from what you (or someone else) did.” That shift—off-policy learning—is what makes Q-learning so powerful in practice: you can collect data with an exploratory behavior policy, then learn a greedy control policy that targets optimal behavior. The engineering challenge is that this freedom introduces new failure modes: unstable updates, maximization bias, misleading evaluation, and brittle exploration schedules.
We will work toward a concrete milestone: implement tabular Q-learning for control in a small discrete MDP, then strengthen it with systematic tuning and diagnostics. Along the way, you will learn how to recognize and mitigate maximization bias (the reason vanilla Q-learning can overestimate values), apply basic off-policy evaluation with importance sampling, and debug common problems like divergence, reward hacking, and sparse rewards. By the end, you should be able to build a reliable training loop, choose exploration and learning-rate schedules on purpose, and evaluate progress without fooling yourself.
A practical mindset for this chapter: treat the algorithm as only one part of the system. Your results will depend just as much on exploration design, reproducibility (random seeds, logging), and validation (baselines, ablations) as on the update rule itself.
Practice note for Milestone: Implement tabular Q-learning for control: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Understand and mitigate maximization bias: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Apply off-policy evaluation basics and importance sampling: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Tune exploration schedules and learning rates systematically: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Diagnose failures (divergence, reward hacking, sparse rewards): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Implement tabular Q-learning for control: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Understand and mitigate maximization bias: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Apply off-policy evaluation basics and importance sampling: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Tune exploration schedules and learning rates systematically: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Diagnose failures (divergence, reward hacking, sparse rewards): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
In on-policy methods (like SARSA), the agent learns the value of the policy it actually follows. If you behave with an b5-greedy policy, your targets and updates reflect that same exploratory behavior. This creates a tight feedback loop: what you do is what you learn. On-policy learning is often more stable in certain settings because the data distribution matches the policy being evaluated.
Off-policy methods break that link. You collect experience using a behavior policy bc(a|s) (often exploratory), but you learn about a different target policy c0(a|s) (often greedy). Q-learning is the canonical example: it can behave b5-greedily, but its update targets the greedy action under current estimates. This difference matters because it enables sample reuse, learning from historical logs, and training from mixed data sources (self-play, demonstrations, replay buffers later in DQN).
However, off-policy learning introduces distribution mismatch: you may be updating values for actions rarely taken by the behavior policy. In tabular settings, Q-learning is still guaranteed to converge under classic assumptions (sufficient exploration, decaying step sizes), but in practice you can still get slow learning or misleading evaluation if you do not track what data you are actually collecting.
For the milestone “implement tabular Q-learning for control,” this section clarifies why the learned Q-function is aimed at optimal control even while you deliberately inject exploration into behavior.
Tabular Q-learning learns an action-value function Q(s,a) by bootstrapping from the best next action according to current estimates. Given a transition (s, a, r, s', done), the core update is:
target = r + b3 * (1 - done) * max_a' Q(s', a')
Q(s,a) b5 Q(s,a) + b1 * (target - Q(s,a))
The “greedy target” max_a' Q(s',a') is what makes Q-learning off-policy: even if your behavior chose a random action, the update assumes the next step will act optimally (greedily). Implementation-wise, the milestone is to produce a minimal, correct control loop: initialize Q to zeros, run episodes, select actions with an exploration policy, step the environment, update Q, and repeat.
Two details make or break real implementations. First, handle terminal transitions correctly by masking out the bootstrap term when done=True; otherwise values can leak beyond episode boundaries. Second, ensure your state representation is stable and hashable (e.g., integers for discrete states, tuples for grid positions). Silent bugs often come from indexing Q incorrectly or mixing up (s,a) keys.
When your Q-learning loop is correct, you should see learning curves that improve steadily in simple environments. If not, Section 5.6 will show how to diagnose whether the issue is exploration, step size, reward scaling, or a bug in episode termination.
Q-learning needs sufficient exploration: if you never try an action, you cannot learn its consequences. The simplest strategy is b5-greedy: with probability b5 pick a random action; otherwise pick argmax_a Q(s,a). The tuning problem is that b5 controls two competing goals—exploration (discovering better actions) and exploitation (refining what you already believe is good). A fixed b5 is rarely ideal across the whole training run.
A systematic approach is to treat b5 as a schedule. Common choices:
Soft policies like Boltzmann/softmax exploration choose actions proportional to exp(Q(s,a)/c4), where c4 (temperature) controls randomness. Softmax can be more sensitive to Q scale: if rewards are large, it becomes nearly greedy unless you increase c4. b5-greedy is often more robust to reward scaling, which is why it is popular for tabular control.
Learning rate b1 interacts with exploration. With high b5 and high b1, Q-values may fluctuate because you update from many random transitions. With low b5 and high b1, you may overfit early estimates and stop exploring too soon. A practical tuning workflow is to sweep a small grid: b1 in {0.1, 0.3, 0.5} and b5_min in {0.01, 0.05, 0.1}, keeping b3 fixed, then choose based on median performance over multiple random seeds.
Practical outcome: you can now “tune exploration schedules and learning rates systematically” instead of guessing. This matters most in sparse-reward tasks, where you may need sustained exploration long enough to stumble into success at least a few times.
The max operator in Q-learning is both the feature and the bug. When Q estimates are noisy (early training, stochastic environments), taking max_a' Q(s',a') tends to select actions whose values are overestimated by chance. This creates maximization bias: Q-learning can systematically overestimate action values, which can slow learning or lead to brittle policies that chase phantom advantages.
You can often spot maximization bias by comparing estimated Q-values with empirical returns. For example, the greedy policy may appear to have high predicted value, but actual evaluation returns lag far behind, and Q-values keep inflating. Another sign is instability when you increase exploration or when rewards are stochastic: the max over noisy samples becomes more optimistic.
Double Q-learning reduces this bias by decoupling action selection from action evaluation. Maintain two tables, Q1 and Q2. On each update, randomly pick one to update. If updating Q1, use Q1 to select the best next action, but use Q2 to evaluate it:
target = r + b3 * Q2(s', argmax_a' Q1(s',a'))
Symmetrically when updating Q2. Intuitively, the noise in Q1 and Q2 is less likely to agree on the same overly-optimistic action, so overestimation is reduced. This section fulfills the milestone “understand and mitigate maximization bias” with a concrete tool that stays tabular and easy to implement.
Double Q-learning is also the conceptual bridge to Deep RL: Double DQN applies the same idea with neural networks and a target network. Understanding it here will make later DQN stabilization techniques feel motivated rather than magical.
Off-policy control answers “how do I learn a better policy from exploratory data?” Off-policy evaluation (OPE) answers “how good is a target policy, using data collected by a different behavior policy?” This matters when evaluation in the real environment is expensive, risky, or impossible (think logged interaction data).
The basic tool is importance sampling. Suppose you have episodes generated by behavior policy bc, but you want to estimate the expected return of target policy c0. For a trajectory c4 = (s0,a0,s1,a1,...), you weight its return G(c4) by the likelihood ratio:
w(c4) = a0_t c0(a_t|s_t) / bc(a_t|s_t)
Then an estimator is the average of w(c4) * G(c4). In practice, raw (ordinary) importance sampling can have huge variance because w(c4) can explode when bc assigns small probability to actions that c0 prefers. A common mitigation is weighted (normalized) importance sampling, dividing by the sum of weights to reduce variance (at the cost of some bias).
Practical guidance for tabular experiments:
This section meets the milestone “apply off-policy evaluation basics and importance sampling” and gives you a reality check: off-policy learning is easy to write down, but off-policy evaluation is statistically delicate. When you later use replay buffers, the same mismatch issues appear—only disguised as “training instability.”
When Q-learning fails, it rarely fails loudly. You might get a flat learning curve, oscillating returns, or a policy that “works” in training but collapses at evaluation. A disciplined debugging approach turns RL from folklore into engineering. This section targets the milestone “diagnose failures (divergence, reward hacking, sparse rewards)” with a toolkit you can reuse throughout the course.
Log the right metrics. At minimum: episodic return (train and greedy-eval), episode length, b5 value, mean/max Q-value, TD error statistics (mean absolute TD), and fraction of actions that are greedy vs random. If Q-values explode while returns do not improve, suspect too-large b1, a bug in done handling, or maximization bias in a stochastic environment.
Use baselines. Always compare against (1) random policy, (2) a simple heuristic if available, and (3) SARSA with the same exploration schedule. If Q-learning underperforms SARSA consistently, the issue is often aggressive bootstrapping or overestimation; Double Q-learning is a targeted fix.
Ablations identify what matters. Change one thing at a time: fixed b5 vs decayed b5, b1 sweep, reward clipping/scaling, removing terminal masking, swapping max with Double Q target. Ablations prevent you from attributing gains to the wrong feature.
Common failure patterns and fixes:
Practical outcome: you can now treat your tabular Q-learning implementation as a testbed. If you can’t make it learn reliably in a small MDP with solid logs and baselines, scaling to DQN later will be painful. Conversely, if you can diagnose and fix these issues here, you will recognize the same patterns when function approximation enters the picture.
1. What is the key idea that makes Q-learning an off-policy control method in this chapter?
2. Why is maximization bias a concern in vanilla Q-learning?
3. In the chapter’s framing, what is the purpose of importance sampling in basic off-policy evaluation?
4. Which practice best matches the chapter’s guidance on tuning exploration schedules and learning rates?
5. Which set best reflects the chapter’s highlighted failure modes to diagnose when training goes wrong?
So far, you have learned how to compute or learn value functions in tabular settings: when states (and actions) can be enumerated and stored in arrays. Chapter 6 is where we cross the bridge to modern reinforcement learning: we replace tables with neural networks and train them with data generated by interaction. This shift unlocks problems with large or continuous observation spaces (images, sensor vectors, game RAM), but it also introduces new failure modes: instability, divergence, sensitivity to hyperparameters, and subtle bugs that “kind of work” but never reach strong performance.
This chapter is organized as a practical engineering workflow. First, you will build a neural Q-function approximator in PyTorch. Then you will assemble the canonical DQN training system: replay buffer plus target network. Next, you will implement correct targets and a training loop that respects the off-policy nature of Q-learning. Finally, you will stabilize learning with normalization, clipping, and schedules, and you will evaluate generalization and robustness across random seeds. We end with a map of what comes next (policy gradients and actor-critic), so you understand DQN’s place in the broader deep RL ecosystem.
As you read, keep a mental model of what can go wrong: (1) targets that depend on the same network you are updating, (2) correlated samples causing destructive gradient steps, (3) exploding Q-values and gradients, and (4) evaluation that accidentally measures “lucky seeds” instead of reliable learning. DQN is not hard because it is mathematically complex; it is hard because it is an engineered system with interacting parts.
Practice note for Milestone: Build a neural Q-function approximator in PyTorch: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Train a DQN with replay buffer and target network: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Stabilize learning with normalization, clipping, and schedules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Evaluate generalization and robustness across seeds: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Map next steps: policy gradients and actor-critic overview: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Build a neural Q-function approximator in PyTorch: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Train a DQN with replay buffer and target network: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Tabular methods assume you can store one value per state (or per state-action pair). That works in Gridworld or small MDPs, but it collapses in realistic environments: the “state” might be a 84×84×4 image stack, a 17-dimensional robot observation vector, or a continuous pose. Even if the state space is technically finite, it is too large to visit often enough for a table to fill in. The practical symptom is sparse coverage: most entries are never updated, and generalization is impossible.
Function approximation solves this by learning a parameterized function, typically a neural network, that maps observations to values. For DQN, the network approximates the action-value function: Q(s, a; θ). Instead of a table row per state, we compute a vector of Q-values for all discrete actions given a state. This is the first milestone: build a neural Q-function approximator in PyTorch that takes an observation tensor and returns a Q-value per action.
Common mistakes: forgetting to add the batch dimension (leading to shape bugs), using a network with the wrong output size, and accidentally detaching tensors (breaking gradients). Another conceptual pitfall is expecting deep networks to “fix” poor exploration. Function approximation generalizes, but it cannot learn good values for regions of state space the agent never visits. So as you move beyond tables, you must treat data collection and exploration strategy as first-class design choices.
A working DQN agent is more than “Q-learning with a neural net.” The algorithm relies on three components that collectively stabilize learning: an online Q-network, a replay buffer, and a target network. Your second milestone is to train a DQN with replay buffer and target network, which means wiring these pieces together correctly and resisting the temptation to simplify away the parts that prevent divergence.
Online Q-network (parameters θ) is the model you update with gradient descent. It chooses actions via an exploration policy, usually ε-greedy: with probability ε pick a random action, otherwise pick argmax_a Q(s,a;θ). Keep action selection in torch.no_grad() and switch the network to eval mode if you use dropout or batch norm (most DQN baselines avoid them, but engineering reality varies).
Replay buffer stores transitions (s, a, r, s′, done). Sampling uniformly from this buffer breaks the short-term correlation present in sequential experience and improves data efficiency: each interaction can be replayed many times. Practical buffer details matter: store observations as compact dtypes (e.g., uint8 for pixels), pre-allocate arrays for speed, and track a circular pointer for constant-time insertion.
Target network (parameters θ⁻) is a periodically updated copy of the online network. Targets for Q-learning depend on max_a′ Q(s′,a′), and if you compute those targets from the same network you are updating, you get a moving target that can destabilize training. The target network “slows down” target drift. In practice you either (1) hard-update every N steps (θ⁻ ← θ), or (2) soft-update with Polyak averaging (θ⁻ ← τθ + (1-τ)θ⁻).
done explicitly; mistakes here silently corrupt targets.When DQN “does nothing,” the usual culprits are (a) ε never decays (agent stays random), (b) target network never updates (stale learning signal), or (c) replay sampling returns wrong shapes/types (training runs but gradients are nonsense). Treat these components as testable units: write small assertions for shapes, dtypes, and value ranges before you chase hyperparameters.
At the heart of DQN is the temporal-difference target for Q-learning. For a batch of transitions, the target for each sample is:
y = r + γ * (1 - done) * max_a' Q_target(s', a'; θ⁻)
The prediction is the online network’s Q-value for the action actually taken:
q = Q_online(s; θ).gather(1, a)
Then you minimize a regression loss between q and y. This section is where many implementations quietly go wrong, because the code “runs” even when the math is subtly incorrect. First, ensure a is shaped (B,1) and is a long tensor so that gather selects the correct column. Second, compute max_a' under torch.no_grad() so gradients do not flow into the target network or the target computation.
The training loop is typically organized around environment steps rather than episodes: collect one step of experience, push it to replay, and every k steps perform one or more gradient updates. This decoupling matters because updates are off-policy: you are learning about the greedy policy while behaving ε-greedily. A practical pattern is:
Common mistakes: forgetting the (1 - done) mask (causes bootstrapping past terminal states), mixing float and int dtypes (especially for rewards and dones), and using the wrong discount γ (e.g., 0.0 or 1.0 by accident). Another subtle issue is reward timing: ensure the reward r corresponds to the transition from s to s′ via action a, not the next step’s reward due to a misplaced assignment.
Practical outcome: by the end of this section, you should be able to train a basic DQN on a small discrete-control benchmark (like CartPole) and see learning curves improve beyond random. It may not be perfectly stable yet—that is the point of the next section.
Deep RL is notoriously sensitive because small distribution shifts in experience can create large changes in targets, which creates large gradients, which changes the policy, which changes the data. Your third milestone is to stabilize learning with normalization, clipping, and schedules—simple tools that dramatically reduce “training collapse.”
Huber loss (a.k.a. Smooth L1) is the default loss for many DQN implementations because it behaves like MSE near zero error but becomes linear for large errors, reducing the influence of outliers. In PyTorch, this is torch.nn.SmoothL1Loss. If you see occasional huge TD errors (common early in training), Huber often prevents a single batch from destroying progress.
Gradient clipping caps the norm (or value) of gradients before the optimizer step. A typical choice is global norm clipping at 10.0. Clipping is not a substitute for fixing bugs, but it is a reliable guardrail when Q-values spike. Implement it right before optimizer.step() with torch.nn.utils.clip_grad_norm_.
Reward scaling and clipping control the magnitude of targets. In some domains (notably Atari), clipping rewards to [-1, 1] historically improved stability by limiting target variance. In other domains, reward clipping can remove meaningful scale. Engineering judgement: if rewards have heavy tails or occasional large spikes, consider clipping; if the reward scale encodes task difficulty (e.g., continuous control costs), prefer normalization or careful learning rates.
Common mistakes: applying normalization inconsistently between current states and next states, clipping rewards in evaluation (evaluation should reflect the true environment return), and using too-large batch sizes that hide instability until it is severe. Stability is not one magic trick; it is the cumulative effect of several small, defensible choices.
Once vanilla DQN works, you can usually get noticeable gains with three extensions. Each addresses a known weakness of the baseline, and each is worth understanding even if you do not implement all of them immediately.
Double DQN reduces overestimation bias. In standard DQN, the max operator uses the same values to select and evaluate actions, which tends to inflate Q-values. Double DQN splits selection and evaluation: choose the best action using the online network, but evaluate it using the target network:
a* = argmax_a Q_online(s',a;θ)
y = r + γ*(1-done)* Q_target(s', a*; θ⁻)
This change is small in code but often big in stability, especially when rewards are noisy.
Dueling DQN changes the network architecture to separately estimate a state value V(s) and an advantage A(s,a), then combine them into Q(s,a). Intuition: in many states, the choice of action barely matters; learning V(s) can be easier than learning each Q(s,a) independently. Practically, dueling heads can speed learning on tasks with many similar actions.
Prioritized Experience Replay samples transitions with probability proportional to their TD error magnitude, focusing updates on “surprising” experiences. This can improve sample efficiency, but it introduces bias that must be corrected with importance-sampling weights. Engineering judgement: prioritized replay adds complexity (data structures, hyperparameters α and β). Use it when you are bottlenecked on environment interactions, not when you are still debugging the basic loop.
This section pairs naturally with the milestone of evaluating generalization and robustness across seeds. Deep RL curves are noisy; good practice is to report mean and variability across seeds, and to save checkpoints so you can inspect regressions when an “improvement” silently worsens stability.
DQN is a value-based method for discrete actions. Its core idea is: learn Q(s,a), then act greedily (with exploration). This breaks down when actions are continuous (you cannot argmax over infinite actions) or when stochastic policies are preferable (e.g., for entropy, exploration, or multi-modal behavior). The next family of methods optimizes the policy directly.
REINFORCE is the simplest policy-gradient algorithm: sample trajectories, compute returns, and push policy parameters in the direction that increases log-probability of actions that led to higher return. It is conceptually clean but high variance; in practice you almost always add a baseline (often a value function) to reduce variance.
Actor-critic combines an actor (policy) and a critic (value function). The critic estimates how good states or actions are, and the actor uses that estimate to update the policy with lower variance than pure REINFORCE. This is the bridge between what you already know (value learning) and policy optimization.
PPO (Proximal Policy Optimization) is a widely used actor-critic method that adds a practical constraint: policy updates should not change the policy too much in one step. PPO’s “clipped objective” is an engineering solution to a common deep RL issue: large, destructive policy updates. Compared to DQN, PPO tends to be easier to apply across many continuous-control tasks, but it is on-policy and can be less sample-efficient.
Practical outcome: by finishing this chapter, you should be able to implement and debug a DQN training system, stabilize it with standard tricks, and evaluate it responsibly across seeds. You should also understand why the field often transitions from DQN to actor-critic methods as tasks become more complex—and you will be ready to make that transition without losing the disciplined workflow you built here.
1. What key capability does Chapter 6 gain by replacing tabular value functions with neural networks?
2. Why does the canonical DQN system include both a replay buffer and a target network?
3. Which training-loop concern is emphasized as essential when implementing DQN targets?
4. Which set best matches the stabilization techniques highlighted in the chapter?
5. What is the main reason the chapter stresses evaluating across multiple random seeds?