Reinforcement Learning — Intermediate
Train, debug, and ship RL game agents in OpenAI Gym end to end.
This course is a short, technical, book-style path to building game AI using OpenAI Gym environments and modern reinforcement learning (RL) workflows. You’ll start by learning the Gym/Gymnasium API and building a clean experiment template you can reuse. Then you’ll progress from tabular methods (great for discrete games) to deep RL (for larger state spaces) and policy gradients (for continuous control). By the end, you’ll know how to train agents, debug instability, and evaluate results with the rigor needed to trust performance claims.
The emphasis is on “ship-ready” skills: reproducible runs, solid baselines, careful evaluation, and tooling that prevents wasted training time. Each chapter is structured like a book chapter with milestones and internal sections so you can follow a coherent progression and revisit topics as references.
Across the chapters, you’ll implement several agent families and the training infrastructure around them:
Chapter 1 establishes the API and tooling so everything you build later is consistent and testable. Chapter 2 gives you strong intuition with tabular RL and discrete games—this makes later deep RL behavior easier to interpret. Chapter 3 transitions to DQN and stability techniques used in real projects. Chapter 4 adds policy gradients so you can handle continuous control and stochastic decision-making. Chapter 5 turns your work into an engineering pipeline—wrappers, reward design, sweeps, checkpoints. Chapter 6 focuses on evaluation, robustness, and packaging your agent so results are reliable and usable.
If you’re ready to train your first Gym agent and build toward deep RL workflows, Register free to access the course. Want to compare options first? You can also browse all courses on Edu AI.
After completing this course, you’ll be able to choose an RL approach that matches an environment’s action space and observation complexity, implement it cleanly, and validate improvements with trustworthy evaluation. The result is a repeatable process for building game AI agents you can iterate on confidently.
Reinforcement Learning Engineer & Applied Researcher
Dr. Maya Chen builds RL systems for simulation-driven decision making, from game agents to robotics prototypes. She has led applied ML teams shipping training pipelines, evaluation harnesses, and reproducible experimentation workflows. Her teaching focuses on practical intuition, clean implementations, and reliable debugging techniques.
Before you write a learning algorithm, you need an environment you can trust. Reinforcement Learning (RL) experiments are unusually sensitive to small changes: library versions, random seeds, episode limits, and even how you log metrics can change conclusions. This chapter builds the “lab bench” you will use for the rest of the course: a working Gym/Gymnasium run loop, a clear understanding of observations and actions, and an experiment template that makes results reproducible and debuggable.
We will work with two families of environments: classic control (e.g., CartPole) and toy text tasks. Classic control gives fast iterations and visual intuition; toy text highlights discrete state/action patterns you’ll later exploit in tabular methods. Along the way, you’ll install the toolkit, run environments end-to-end, create a random-agent baseline with score tracking, and learn how to record and replay episodes for visual debugging. The goal is practical fluency: you should be able to open a new environment, inspect its spaces, run it deterministically, and produce an experiment folder that another person can reproduce.
One engineering mindset to adopt early: treat environment interaction code as production code. If your training loop “sort of works,” debugging later will be painful because learning curves already have high variance. Keep your environment interface clean, log everything you need, and establish baselines before you optimize.
Practice note for Install the RL toolkit and verify a working Gym run loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run and inspect classic control tasks and toy text environments: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a reproducible experiment template (configs, seeds, logging): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create your first baseline: random agent + score tracking: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint and replay episodes for visual debugging: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Install the RL toolkit and verify a working Gym run loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run and inspect classic control tasks and toy text environments: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a reproducible experiment template (configs, seeds, logging): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create your first baseline: random agent + score tracking: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Historically, “OpenAI Gym” was the standard RL environment API. Today, the actively maintained fork is Gymnasium. Many tutorials still say import gym, while modern code often uses import gymnasium as gym. Conceptually, they provide the same job: a catalog of environments with a standard interface so your agent code can swap games without rewriting the loop.
Think of an environment as a contract between “the world” and your agent. The contract is defined by a few methods and properties: env.reset() starts a new episode and returns an initial observation; env.step(action) advances the environment by one time step given an action; env.observation_space and env.action_space describe valid inputs/outputs; and optional render() displays a frame for humans.
Installation is boring but important: pin versions to avoid silent behavior changes. In a fresh virtual environment, install Gymnasium and common extras. For classic control, you typically need an extra dependency set (and system packages on some OSes). Your first verification step should not be “train a model,” but “run five episodes and print returns.” If that fails, fix it before proceeding.
gym.make("CartPole-v1")) and confirm that reset/step work without errors.reset() or step().Finally, recognize that environments can be parameterized. Time limits, rendering modes, and reward variants can change the task. In research or team work, always write down the exact environment ID and version, plus any wrappers or configuration you applied. Your agent is only as comparable as the environment contract you ran it against.
Every Gym/Gymnasium environment declares two spaces: observation_space and action_space. These are not just metadata; they are how you prevent invalid actions, shape neural networks correctly, and design preprocessing. You should make it a habit to print them at the start of every run.
The two most common space types in game AI are Discrete(n) and Box(low, high, shape, dtype). A Discrete action space means the agent chooses an integer action in [0, n-1] (e.g., left/right). A Box action space means the agent outputs a continuous vector within bounds (e.g., steering angle and throttle). Observations follow the same pattern: toy text tasks often use Discrete observations (a state index), while classic control often uses a Box observation (a vector of floats like position and velocity).
This matters immediately for algorithm choice. Tabular Q-learning and SARSA (coming in a later chapter) require discrete state/action representations, or a discretization strategy. Deep methods like DQN typically assume Discrete actions but can handle Box observations by feeding them into a neural network. Policy gradient methods are often used when the action space is Box, because the policy can naturally parameterize continuous distributions.
action = env.action_space.sample(), obs = env.observation_space.sample().Box observation as unbounded. Use the declared bounds to reason about normalization, clipping, and numeric stability.Toy text environments are especially good for building intuition about state indexing and dynamics. Even when the observation is an integer, you still need to understand what it represents (a cell in a grid, a room in a map, etc.). Don’t guess—inspect the environment documentation, and if possible, print decoded states. This habit pays off later when you debug reward shaping or unexpected agent behavior.
The heart of RL engineering is the step loop. In Gymnasium, reset() returns (obs, info), and step(action) returns (obs, reward, terminated, truncated, info). This is a key difference from older Gym, which used a single done flag. You should treat terminated as “the task ended naturally” (success/failure terminal condition) and truncated as “the episode was cut off” (time limit or external constraint).
A correct run loop checks terminated or truncated to end an episode. This is not pedantry: mixing them can distort learning curves and evaluation. If your environment has a time limit wrapper, it will often set truncated=True at the step limit. For evaluation, you may want to report both: success rate (often tied to termination conditions) and average return (affected by truncation).
Start by running environments without any learning, just to ensure you understand their dynamics. Run a few episodes on a classic control task (like CartPole) and a toy text task, printing per-episode return and length. Add info printing sparingly; it’s useful for debugging but can become noisy.
truncated. This can cause episodes to “never end” in your code if you only check terminated.Finally, rendering: many environments support render_mode="human" for interactive display or render_mode="rgb_array" for frames you can save. Rendering inside the training loop can drastically slow experiments; prefer rendering only during evaluation or when recording a small number of episodes for debugging.
Reproducibility is not optional in RL. Learning curves are noisy, and without careful control you can “prove” anything. The first tool is seeding: initialize all random number generators so that reruns produce the same sequence of stochastic choices—at least as much as the environment and hardware allow.
In Gymnasium, you typically seed at reset(seed=...) for environment dynamics. You should also seed your action sampling (if using numpy/python random) and your ML framework (e.g., PyTorch). A minimal reproducible template sets a single integer seed in config, then derives secondary seeds for train/eval splits if needed.
Determinism has limits. Some environments contain hidden nondeterminism; GPU operations can be nondeterministic depending on kernels; multi-threading can reorder operations. The practical goal is not mathematical determinism everywhere, but controlled variability: the ability to rerun a specific experiment and get the same result distribution, and the ability to average across multiple seeds for robust comparisons.
Build a habit of logging: environment ID, library versions, seed(s), episode limit, and any wrappers. When something looks “too good,” reproducibility metadata is how you determine whether it’s a real improvement or an accidental change in settings. This chapter’s experiment template will make these fields first-class, not an afterthought.
Wrappers are one of the most powerful features of the Gym ecosystem: they let you modify observations, rewards, actions, episode lengths, and logging without changing the underlying environment. In game AI projects, wrappers are where you put “environment engineering” so your agent code stays clean and reusable.
Common wrapper uses include: clipping or normalizing observations, stacking frames for visual tasks, discretizing continuous actions (with care), and shaping rewards. Even simple monitoring is typically implemented as a wrapper: record episode returns and lengths, write them to disk, and optionally capture videos. Gymnasium provides utilities like RecordVideo (and monitoring tools) that can automatically save episode footage to a directory—crucial for visual debugging when an agent exploits a loophole in your reward function.
When recording, decide on a policy: record only evaluation episodes (e.g., every N training iterations), or record the first K episodes for sanity checks. Recording every episode can create huge files and slow training. Use deterministic evaluation seeds so videos are comparable across runs.
render_mode="rgb_array" and a video recorder wrapper, then replay saved videos when investigating failures.Wrappers also help you standardize observation preprocessing. For example, even for vector observations you might want to cast dtypes consistently (float32), clip outliers, or concatenate additional signals. Keep wrappers small and composable, and document them in your experiment config so others can reproduce the exact environment pipeline.
Before implementing Q-learning, DQN, or policy gradients, you need a baseline. The simplest baseline is a random agent: sample actions from env.action_space and track episode returns. This baseline is deceptively valuable. It verifies your loop, your logging, and your metrics, and it provides a sanity check: if your learned agent performs worse than random, something is wrong (or the reward is misleading).
Score tracking should be treated as part of the experiment, not console spam. Store per-episode return, episode length, and termination/truncation counts. Save a summary (mean return over last 100 episodes, etc.) so you can compare runs. Even in Chapter 1, start practicing evaluation discipline: separate training episodes (where exploration happens) from evaluation episodes (fixed policy, no learning) where you measure performance.
A practical experiment folder structure keeps outputs organized and reproducible. A common pattern is:
Checkpointing is not only for neural networks. For debugging, it’s useful to save enough information to “replay” an episode: the seed, the action sequence, and (optionally) the observed transitions. When a bug appears—an unexpected reward spike, a sudden performance collapse—you can reload the exact episode and step through it. This practice turns RL from guesswork into engineering.
By the end of this chapter, you should have a working template that: (1) installs and runs Gym/Gymnasium environments, (2) inspects spaces, (3) executes a correct step loop, (4) controls seeds, (5) logs metrics to disk, and (6) records and replays episodes. That foundation will let you focus on learning algorithms in later chapters instead of fighting tooling.
1. Why does Chapter 1 emphasize building a trusted, reproducible Gym run loop before implementing any learning algorithm?
2. What is the main purpose of using both classic control and toy text environments in this chapter?
3. Which set of practices best matches the chapter’s idea of a reproducible experiment template?
4. Why does the chapter recommend creating a random-agent baseline with score tracking early on?
5. What is the practical debugging value of checkpointing and replaying episodes?
In Chapter 1 you learned how to step through a Gym/Gymnasium environment and collect transitions. Now we turn that stream of experience into a working game-playing agent using tabular reinforcement learning. “Tabular” means we explicitly store values in arrays or dictionaries keyed by state (or state–action) pairs. This is the right tool when the environment has a small, discrete state space (e.g., FrozenLake, Taxi, Blackjack, small grid worlds), and it is also the best way to build correct intuition before you scale up to deep networks.
This chapter is deliberately engineering-focused: you will frame a Gym environment as a Markov Decision Process (MDP), implement two classic learning algorithms (Q-learning and SARSA), tune exploration schedules to improve sample efficiency, and export a learned policy for evaluation episodes. Along the way we’ll emphasize reproducibility (seeds, deterministic evaluation), and the common pitfalls that make early RL code “seem to work” while actually learning the wrong thing.
We will use Gymnasium-style APIs in the examples. If you’re using classic Gym, the ideas are identical; only the return signatures of reset() and step() differ slightly.
Practice note for Model an MDP from a Gym environment (states, actions, rewards): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement ε-greedy Q-learning and train on a discrete task: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement SARSA and compare learning behavior to Q-learning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Tune exploration schedules and measure sample efficiency: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Export a learned policy and run evaluation episodes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Model an MDP from a Gym environment (states, actions, rewards): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement ε-greedy Q-learning and train on a discrete task: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement SARSA and compare learning behavior to Q-learning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Tune exploration schedules and measure sample efficiency: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Every Gym environment can be viewed as an MDP defined by (S, A, P, R, \gamma): states, actions, transition dynamics, rewards, and discount factor. In practice you rarely know P explicitly; you sample it by calling env.step(action). The key engineering step is deciding what you will treat as “state” for tabular learning. If the observation is already a small integer (e.g., Discrete(n)), you can index directly. If the observation is a tuple (like Taxi’s row/col/passenger/destination) you can often map it to a single integer via the environment’s built-in encoding or a custom hashing function.
Gymnasium exposes the information you need to model the MDP:
env.observation_space tells you whether the observation is discrete or needs encoding.env.action_space.n gives the number of discrete actions for tabular Q tables.reset(seed=...) and action_space.seed(...) support reproducible experiments.Value functions translate “game scoring over time” into something we can optimize. The state-value function V(s) is the expected discounted return from state s, and the action-value function Q(s,a) is the expected discounted return from taking action a in s and following a policy thereafter. In games, sparse rewards are common (e.g., +1 for reaching the goal, 0 otherwise). Tabular methods cope well as long as the state space is manageable and you explore enough.
Practical judgement: prefer Q(s,a) for control (learning to act), because selecting actions via argmax_a Q(s,a) is straightforward. Keep \gamma consistent with the task: for episodic games with a time limit, values like 0.95–0.99 are common. Setting \gamma too low makes the agent shortsighted; too high can slow learning when rewards are very delayed.
Common mistake: treating high-dimensional observations (images, continuous vectors) as tabular states. If observation_space is not Discrete (or a small MultiDiscrete), tabular methods will explode in size and effectively never revisit the same state. In that case you need function approximation (Chapter 3), but for now choose environments designed for discrete tabular learning.
The Bellman equations are the “accounting identity” of RL: the value of a situation equals immediate reward plus the discounted value of what comes next. For action-values, the optimal Bellman equation is:
Q*(s,a) = E[r + \gamma * max_{a'} Q*(s',a')]
In game terms, imagine you are choosing a move now (a) and you assume you will play optimally afterward. The max represents “best possible future play.” This is why Q-learning is often described as learning from imagined optimal futures, even if your current behavior is exploratory or messy.
This equation also clarifies what “learning” does in code: each step you get a sample (s,a,r,s') and you adjust your table entry Q[s,a] toward a target that mixes the observed reward and your current estimate of future best play. You are not computing exact expectations; you are doing stochastic approximation with a learning rate \alpha. Over many samples, the noisy targets average out.
Engineering judgement: the Bellman target can be unstable if you bootstrap from bad estimates too aggressively. This is why the learning rate matters and why you should watch learning curves over multiple random seeds. Also be careful with terminal transitions: if the episode ends at s', there is no “future,” so your target should drop the bootstrap term (treat max Q(s',·) as 0). Failing to handle terminal states is a classic bug that silently prevents convergence.
When you interpret rewards as “game score,” remember that the agent maximizes discounted return, not necessarily the final score you intuitively care about. If the environment gives a living penalty or time penalty, the agent may learn to finish quickly even with lower final reward. That’s not wrong—it’s exactly what the MDP specifies. If your learned behavior surprises you, first re-check the reward function and termination conditions rather than immediately tuning hyperparameters.
Q-learning is the standard starting point for discrete control because it is simple and off-policy: you can behave with an exploratory policy while learning the greedy policy implied by your Q table. A minimal training loop needs: a Q table, an action-selection rule (ε-greedy), and the Q-learning update.
Core update for a non-terminal transition:
td_target = r + gamma * max_a' Q[s_next, a']
td_error = td_target - Q[s, a]
Q[s, a] += alpha * td_error
In Gymnasium, step returns (obs, reward, terminated, truncated, info). Treat the transition as terminal if terminated or truncated is true. That distinction matters: terminated means the environment’s terminal condition; truncated means a time limit. For learning, both usually mean “no bootstrap beyond this step” unless you explicitly want time-limit bootstrapping (an advanced topic).
Discrete(nS) and Discrete(nA), use Q = np.zeros((nS, nA), dtype=np.float32).env.reset(seed=seed) at the start of each run, and seed env.action_space too.Pitfall: mixing up observation types. Some environments return obs as an integer state index; others return numpy arrays or tuples. If you do Q[obs] and obs is an array, you’ll get wrong indexing or errors. Make state handling explicit: if the observation space is not a plain Discrete, create a deterministic encoding function and unit-test it on a few samples.
Pitfall: evaluating during training with ε still active. Q-learning can look “bad” if you keep exploring in evaluation episodes. Separate concerns: during training you use ε-greedy; during evaluation set ε=0 and run greedy actions only. Also log both episodic return and episode length. In sparse-reward tasks, length can reveal that the agent is learning to avoid failure (or exploit time limits) even before rewards increase.
SARSA (State–Action–Reward–State–Action) looks similar to Q-learning but updates toward the value of the action you actually take next, not the greedy best action. The update is:
td_target = r + gamma * Q[s_next, a_next]
where a_next is selected by your current behavior policy (often ε-greedy). This makes SARSA an on-policy algorithm: it learns the value of the policy it is executing. In game AI terms, SARSA learns to be good while still making occasional exploratory mistakes, and it internalizes the consequences of those mistakes.
When does that matter? In “cliff-like” environments where a single wrong move causes catastrophic loss, Q-learning can learn a risky policy because its target assumes perfect greedy behavior after the current step. SARSA tends to learn safer paths when ε is non-zero, because it expects that exploration might push it into danger. This difference is not philosophical; it shows up in learning curves and final behavior.
Implementation detail: SARSA requires choosing a_next before you compute the update, which slightly changes your step loop:
a from s (ε-greedy).s_next, r, done.a_next from s_next (ε-greedy) unless done.Q[s,a] toward r + gamma * Q[s_next, a_next] (or just r if done).s=a_next and continue.Common mistake: forgetting to handle terminal states and still sampling a_next, which can index garbage or incorrectly bootstrap. Another mistake: comparing SARSA and Q-learning with different ε schedules. If you want a fair comparison of learning behavior, run both with the same alpha, gamma, and identical exploration schedule, and average results over multiple seeds.
Practical outcome: SARSA is a great diagnostic tool. If Q-learning “learns” but produces brittle behavior, implement SARSA and see whether on-policy learning stabilizes training. If SARSA is stable but underperforms, it often points to an exploration schedule that is too aggressive late in training.
Exploration is the difference between an agent that memorizes a lucky trajectory and an agent that reliably wins. In tabular RL, ε-greedy is the default: with probability ε choose a random action; otherwise choose argmax_a Q[s,a]. The engineering question is not “should we explore?” but “how much, for how long, and how do we measure sample efficiency?”
A practical schedule starts with high exploration and decays it. Two common choices:
eps_start to eps_end over N steps/episodes. Easy to reason about and reproduce.eps_end + (eps_start-eps_end)*exp(-t/τ). Decays quickly at first, then slowly.Measure sample efficiency by plotting average return versus environment steps (not episodes). This avoids misleading comparisons when different runs produce different episode lengths. Also log ε over time so you can correlate “performance jumps” with exploration turning down.
Softmax (Boltzmann) exploration is another option: choose actions with probability proportional to exp(Q[s,a]/T) where T is a temperature. High T is exploratory; low T is near-greedy. Softmax can be smoother than ε-greedy in environments with many actions, but it is sensitive to Q scale: if rewards are large, exp(Q/T) can overflow or collapse to near-deterministic too early. If you use softmax, normalize or clip Q values, or tune temperature carefully.
Common mistakes:
argmax may always pick the first action, reducing effective exploration. Randomly break ties when exploiting.A robust workflow is: pick a baseline ε schedule, train for a fixed step budget, evaluate greedily every K steps, and then adjust decay based on whether improvement stalls early (too little exploration) or remains noisy late (too much exploration). This tuning discipline carries directly into DQN later, where exploration schedules are even more consequential.
Training curves are not evaluation. During training you inject randomness (ε-greedy) and continuously change the policy. Evaluation should answer a narrower question: “How well does the learned policy perform when executed greedily?” To do that, freeze learning (alpha=0 or no updates), set ε=0, and run a batch of episodes with controlled seeds.
A solid protocol for discrete tabular agents includes:
Exporting a learned policy is straightforward with tabular methods: you can store either the full Q table (for later fine-tuning) or the greedy policy pi[s] = argmax_a Q[s,a]. For portability, save as a NumPy file (.npy) or a compact JSON (state → action). When you reload, ensure the state encoding is identical; mismatched encodings are a frequent source of “it worked yesterday” failures.
For evaluation runs, consider using wrappers like RecordEpisodeStatistics to capture returns and lengths, and RecordVideo for a small number of qualitative rollouts. Keep video separate from metrics: recording can slow environments and change timing, which matters in some tasks.
Finally, treat truncation carefully. If episodes are truncated due to a time limit, high returns might reflect “survival” rather than goal-reaching. Report both return and success rate when the environment provides a success signal (sometimes in info), and consider adding task-specific metrics such as steps-to-goal. With these protocols, your tabular agents become reliable experimental baselines—and a trustworthy foundation for the deep RL methods that follow.
1. Why is tabular reinforcement learning an appropriate approach for environments like FrozenLake, Taxi, or small grid worlds?
2. When modeling a Gym environment as an MDP for tabular RL, which components must you explicitly identify from interaction with the environment?
3. What key understanding outcome does the chapter emphasize when comparing Q-learning and SARSA?
4. What is the main reason the chapter stresses tuning exploration schedules (e.g., how ε changes over time)?
5. Which evaluation practice best matches the chapter’s guidance on avoiding misleading conclusions about a learned policy?
Tabular Q-learning works well when the state space is small and enumerable. Game environments quickly break that assumption: observations may be vectors (positions, velocities), grids, or pixels, and the number of distinct states becomes effectively infinite. Deep Q-Learning (DQN) replaces the Q-table with a neural network that estimates Q(s, a) for every available action from an observation s. This chapter builds DQN in the way you will actually engineer it: start with a neural Q-network, then add experience replay, then stabilize training with a target network, and finally learn to evaluate action-values and detect overestimation.
DQN is not “just Q-learning with a neural net.” The combination of bootstrapping (targets depend on the network), off-policy learning (data from older policies), and function approximation (neural networks) is notoriously unstable. Good DQN implementations are mostly about controlling that instability: ensuring the data distribution is well-behaved, targets are not moving too fast, and exploration doesn’t collapse.
Throughout this chapter, assume a discrete-action Gym/Gymnasium environment (for example CartPole-v1 or a small gridworld). Your workflow should be reproducible: set seeds, log metrics per episode, keep evaluation separate from training, and treat hyperparameters as versioned configuration rather than “tweaks.”
The goal is not only to get a working agent, but to develop engineering judgment: when training curves look wrong, you should know which component is the likely cause and what to change first.
Practice note for Build a neural Q-network and replace the Q-table: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add experience replay and stabilize training: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Introduce a target network and reduce divergence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement action-value evaluation and overestimation checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Train a DQN agent end-to-end and benchmark performance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a neural Q-network and replace the Q-table: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add experience replay and stabilize training: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Introduce a target network and reduce divergence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
In tabular Q-learning, you store Q[s, a] explicitly and update it with the Bellman target. With high-dimensional observations, you cannot index a table, so you approximate Q(s, a) with a function Qθ(s, a) parameterized by neural network weights θ. Practically, your network will take an observation tensor and output a vector of size n_actions, where entry i is the estimated return for action i.
The core target remains: y = r + γ * max_a' Q(s', a') for non-terminal transitions. The difference is how you apply it: you compute a predicted value for the chosen action Qθ(s, a), then minimize a regression loss between that scalar and the target y. This is the DQN learning step. You will implement it with a deep learning framework (PyTorch is typical): forward pass, compute loss, backprop, optimizer step.
Engineering judgment: treat the network as a drop-in replacement for a table only at the interface level. Internally, the network introduces new failure modes (scale sensitivity, gradient explosion, feature dependence). Two practical habits help immediately: (1) normalize or at least clip observations when ranges are extreme; (2) clip rewards to a small range when using pixel-based or sparse-reward games, because large reward magnitudes can produce unstable Q scales.
Common mistake: training “online” on consecutive transitions without replay. With tabular methods this may still work; with neural approximators it often diverges because consecutive samples are highly correlated and targets move too quickly. You will fix this in Section 3.3, but keep it in mind: a naive neural Q-learning loop is usually unstable even in simple tasks.
For vector observations (like CartPole-v1), a small multilayer perceptron (MLP) is enough. A practical starting point is two hidden layers of 128 units with ReLU activations. Your output layer should be linear (no activation) with shape [batch, n_actions]. Avoid making the network too large at first: bigger networks can overfit the replay buffer early and produce sharper, less stable Q estimates.
For grid-like observations (e.g., a 2D board state), a shallow convolutional network (ConvNet) can help. Still, keep your first implementation simple: one or two convolution layers followed by a small fully connected head. The point of this chapter is DQN mechanics (replay, targets, evaluation), not squeezing performance through architecture search.
float32; if using pixels, scale to [0, 1] and consider grayscale and frame stacking via wrappers.argmax(Q(s, ·)) during exploitation.Engineering judgment: initialization and activation choices matter more than you might expect. ReLU is standard; if you observe “dead” neurons (loss stops improving and Q-values become flat), try LeakyReLU or reduce learning rate. Also consider using LayerNorm for non-stationary observation scales in some game tasks, but introduce it only if you have evidence (unstable gradients or sensitivity to observation magnitude).
Common mistake: using a softmax on outputs because it “feels like a policy.” DQN outputs action-values, not probabilities. Softmax will distort the Bellman regression problem and typically breaks learning.
Experience replay is the practical upgrade that makes DQN trainable. Instead of updating on the most recent transition, you store transitions (s, a, r, s', done) in a buffer and sample random minibatches. This breaks temporal correlations and allows you to reuse older experiences, improving sample efficiency and stability.
A robust replay buffer implementation is mostly careful bookkeeping. Use a fixed-capacity ring buffer so memory does not grow unbounded. Store observations efficiently (NumPy arrays are common) and convert sampled batches to tensors only when training. Include done flags so terminal transitions zero out the bootstrap term: for terminals, y = r; otherwise y = r + γ * max Q(s', ·).
Sampling strategy is an engineering choice. Uniform sampling is simplest and should be your default until you diagnose a problem. If learning is slow because rewards are sparse, prioritized replay can help by sampling “surprising” transitions more often, but it adds bias corrections (importance sampling weights) and more moving parts to debug. In a course setting, you should first get a clean uniform replay implementation that logs: buffer size, average TD error, and loss.
Common mistake: storing references to mutable arrays (especially when environments reuse buffers). Always store a copy of observations when pushing to the buffer, or you may train on overwritten states and see inexplicable instability.
Even with replay, DQN can diverge because the target depends on the same network you are updating. The fix is a target network: maintain a second set of parameters θ^- (often called q_target) used only to compute the bootstrap term. Your online network θ (called q_online) is optimized; the target network is updated more slowly.
The target becomes: y = r + γ * (1 - done) * max_a' Q_{θ^-}(s', a'). This reduces the “chasing a moving target” problem. You then minimize the loss between Q_{θ}(s, a) and y. In practice, use Huber loss (a.k.a. smooth L1) rather than MSE: it is less sensitive to occasional large TD errors and often prevents unstable gradients early in training.
N environment steps (e.g., 1,000–10,000), copy parameters: θ^- ← θ.θ^- ← τθ + (1-τ)θ^- with small τ (e.g., 0.005).no_grad() so you do not backprop through the target network.Action-value evaluation matters here: log both max_a Q(s, a) and the average Q for sampled states. If Q-values drift upward without corresponding reward improvements, you may have overestimation (addressed next section) or unstable targets (target update too frequent, learning rate too high).
Common mistake: updating the target network too often (e.g., every step) while also using high learning rates. This collapses the benefit of having a separate target and can make training as unstable as the naive version.
DQN failure is usually visible in logs before it is visible in returns. The two most common patterns are (1) Q-values explode while episode return stagnates or degrades, and (2) exploration collapses early, producing a policy that repeats a suboptimal behavior forever.
Exploding Qs often come from a combination of bootstrapping and function approximation: a slight overestimate at s' becomes a larger target at s, and the error compounds. To detect this, track statistics of predicted Q on a fixed evaluation batch of states sampled from the replay buffer. If the mean/max Q grows steadily while rewards do not, act immediately: lower learning rate, increase target update interval, use Huber loss, clip gradients (e.g., max norm 10), and consider reward clipping.
Overestimation checks: DQN uses max over noisy estimates, which biases values upward. A practical diagnostic is to compare Q estimates against empirical discounted returns from short rollouts (not perfect, but can reveal massive inflation). Another check is to log the gap between the best and second-best action-values; if the gap becomes huge early, the network may be confidently wrong. The standard algorithmic fix is Double DQN (select action with online net, evaluate with target net), but even if you don’t implement it yet, recognizing overestimation is important engineering skill.
Dead exploration usually comes from an epsilon schedule that decays too fast, or from training beginning before the replay buffer contains diverse behavior. If epsilon drops quickly, the policy becomes near-greedy on a partially trained network and stops discovering better trajectories. Fixes: increase epsilon_decay_steps, keep a higher minimum epsilon (e.g., 0.05–0.1 for simple tasks), and delay learning until after a warm-up collection phase.
Common mistake: evaluating performance using the same epsilon-greedy policy as training. Your benchmark should use a separate evaluation loop with a fixed low epsilon (often 0.0) and no learning, run across multiple seeds for reliability.
To train a DQN end-to-end and benchmark it, you need a disciplined configuration and a clear view of training curves. A minimal but practical setup for a vector-observation environment (like CartPole-v1) might look like: discount γ=0.99, replay capacity 100,000, batch size 64, learning rate 1e-3 or 5e-4, target hard update every 2,000 steps, gradient clipping at 10, and an epsilon schedule from 1.0 to 0.05 over 50,000–200,000 steps. Use an Adam optimizer and Huber loss.
Read your curves like an engineer. Plot (a) episode return (train and eval), (b) TD loss, (c) mean/max Q on a fixed batch, and (d) epsilon. Healthy learning typically shows eval return rising while loss decreases or stabilizes; Q-values should increase only to a scale consistent with achievable discounted returns. For example, if the maximum episode length is 500 with reward 1 per step, the true optimal return is about 500; Q-values orders of magnitude beyond that are a red flag.
Benchmarking performance means comparing against baselines and controlling randomness. Always run multiple seeds (at least 3; ideally 5–10), log wall-clock time and environment steps, and report mean and variability of evaluation return. If you change one ingredient (e.g., target update cadence), keep everything else fixed and rerun: this is the simplest form of ablation study and is how you build confidence that an improvement is real.
Practical outcome: by the end of this chapter’s implementation, you should have a DQN agent that (1) uses a neural Q-network, (2) trains from a replay buffer, (3) stabilizes bootstrapped targets with a target network, and (4) produces interpretable training logs that let you diagnose overestimation and exploration issues quickly.
1. Why does tabular Q-learning become impractical in many game environments described in Chapter 3?
2. In DQN, what replaces the Q-table and what does it output?
3. What is the main training-stability purpose of adding an experience replay buffer in DQN?
4. Why does Chapter 3 introduce a separate target network in addition to the online Q-network?
5. According to the chapter, why is DQN training notoriously unstable compared with a simple supervised learning setup?
So far, you have built value-based agents that choose from a small, discrete menu of actions. Many “game AI” problems are not like that. Steering a car, aiming a turret with analog rotation, controlling a character’s movement vector, or tuning a camera’s yaw/pitch are naturally continuous. Discretizing continuous actions is sometimes workable, but it often explodes the action count, wastes samples, and produces jerky behavior. Policy gradient methods solve this by directly learning a parameterized policy that outputs a distribution over actions, allowing smooth control and principled exploration.
This chapter implements REINFORCE (the classic Monte Carlo policy gradient) and then adds a learned baseline/value function to reduce variance. You will also implement a Gaussian policy for continuous action spaces, stabilize updates with advantage normalization and gradient clipping, and learn how to diagnose failure modes using entropy and KL drift. Along the way, you will develop engineering judgement: when policy gradients are the right tool, how to pick sane hyperparameters, and how to debug learning dynamics without guessing.
The mental model to keep: value-based methods (like DQN) learn “how good is each action” and derive a policy by argmax. Policy gradient methods learn the policy directly—“what action distribution should I sample from”—and adjust it in the direction that increases expected return. This shift is what makes continuous control practical.
Practice note for Implement REINFORCE with a stochastic policy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add a baseline/value function to reduce variance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Handle continuous actions with Gaussian policies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Normalize advantages and stabilize updates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compare policy gradients vs DQN on suitable environments: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement REINFORCE with a stochastic policy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add a baseline/value function to reduce variance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Handle continuous actions with Gaussian policies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Normalize advantages and stabilize updates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compare policy gradients vs DQN on suitable environments: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
In continuous control, the action space is not a small set but an interval (or multiple intervals), such as steering in [-1, 1] and throttle in [0, 1]. A DQN-style approach would require discretizing each dimension. With two dimensions and just 11 bins each, you already have 121 discrete actions; with three dimensions it becomes 1331. This “curse of discretization” increases network output size, slows learning, and can prevent fine-grained behaviors that feel natural in games.
Policy-based methods avoid this by modeling a distribution over continuous actions and sampling from it. The policy can express “try slightly left” or “apply gentle throttle” without creating thousands of discrete buttons. This is also better aligned with exploration: instead of epsilon-greedy (which jumps to unrelated actions), stochastic policies explore locally around the current behavior, which is critical in physics-heavy environments.
Engineering judgement: start by asking whether discretization changes the nature of the task. If discretizing makes the game feel “grid-like” (steering snaps, aiming is stepwise), policy gradients will typically be simpler and higher quality.
REINFORCE learns a stochastic policy c0b8(a|s) by maximizing expected return. The key identity is the log-derivative trick: b7(b8)=E[a3 R] implies b1b7/b1b8 = E[b1 log c0b8(a|s)/b1b8 * G], where G is the return from that timestep. In practice, you sample trajectories with the current policy, compute returns, and do gradient ascent on log-probability weighted by return.
Workflow for a minimal REINFORCE loop:
Common mistakes: (1) forgetting to detach the return tensor (it should not backprop through the reward computation), (2) mixing up signs (maximize return but minimize loss), (3) using an incorrect log_prob (ensure it matches the distribution you sampled from), and (4) computing returns across episode boundaries when you batch episodes together.
Practical outcome: you can now train a policy in Gymnasium tasks where the action is sampled from a distribution. Even before adding a baseline, you should see learning on simple problems, but it may be noisy and unstable—variance reduction is the next step.
Pure REINFORCE has high variance because returns can vary wildly between episodes, especially in sparse-reward or long-horizon games. The classic fix is to subtract a baseline b(s) that does not depend on the action. The gradient remains unbiased, but variance drops: replace G_t with (G_t - b(s_t)). The best baseline is the state value function Vc6(s), learned by regression to returns.
Implementation pattern (actor-critic style, but still Monte Carlo):
Two engineering details matter. First, stopgrad (detach) the advantage in the actor loss so the critic does not get updated through the policy gradient term. Second, keep critic learning stable: use a separate optimizer or at least separate loss terms and possibly different learning rates. If the critic is too weak, advantages are noisy; if it overfits, it can introduce oscillations.
Advantage normalization is a simple, high-impact stabilizer: within a batch, normalize A to zero mean and unit standard deviation before the actor update. This does not change the solution (it scales the gradient), but it makes learning rates easier to tune and reduces sensitivity to reward scale.
For continuous actions, the most common stochastic policy is a factorized Gaussian: c0b8(a|s)=N(bcb8(s), c3b8(s)). Your network outputs a mean vector bc and either a log-standard-deviation vector logc3 (state-dependent) or a global parameter logc3 that is learned. Sampling uses reparameterization in PyTorch distributions: create Normal(bc, c3), sample, and compute log_prob of the sampled action.
Gymnasium environments often bound actions (e.g., Box(low=-1, high=1)). A raw Gaussian is unbounded, so you need to handle bounds. Two practical approaches:
For a first implementation, you can start with clamping to validate the loop, then switch to tanh-squashing once learning works. When computing log_prob for multidimensional actions, sum per-dimension log_probs to get a scalar per timestep. Also, remember to store the log_prob from the distribution used at sampling time; recomputing later with updated parameters breaks the on-policy gradient.
Practical outcome: you can control continuous-action tasks (e.g., Pendulum, MountainCarContinuous, LunarLanderContinuous) with smooth behavior and meaningful exploration.
On-policy policy gradients are sensitive to how much data you collect per update. Updating after every single episode can work, but gradients are noisy. A more reliable pattern is to collect a batch of transitions (e.g., 2k–10k steps) possibly across multiple episodes, then do one or a few optimization epochs on that batch. This gives you better advantage statistics for normalization and more stable critic targets.
Recommended practical workflow:
Learning rates: policy gradients often need smaller actor learning rates than you expect (e.g., 3e-4 to 1e-3), while critics may tolerate similar or slightly higher rates. If returns oscillate or collapse, reduce the actor LR first. If the value loss explodes, reduce critic LR and consider clipping returns or rewards.
Gradient clipping is a practical safety net. Clip global norm (e.g., 0.5 to 1.0) for both actor and critic to prevent rare large advantages from causing catastrophic parameter jumps. This is especially important when reward scale is large or when early policies occasionally stumble into very high-return episodes that skew gradients.
Comparison note: DQN benefits from replay buffers and off-policy reuse of data; REINFORCE does not. Accept that policy gradients trade sample efficiency for compatibility with continuous actions and simpler action modeling.
Policy gradient training can fail silently: the code runs, but learning stalls or becomes unstable. Add diagnostics early. Three signals are especially useful: policy entropy, KL drift, and reward/advantage scale.
Entropy measures how random the policy is. For a Gaussian policy, entropy is directly related to logc3. Track mean entropy per batch. If entropy collapses too early (very small c3), exploration ends and the agent may converge to a bad local optimum. If entropy stays extremely high, the policy may never commit. A common remedy is adding an entropy bonus to the actor loss: L_actor = -E[log_prob * A] - b2 * E[entropy]. Tune b2 small (e.g., 1e-3 to 1e-2).
KL drift approximates how far the new policy moved from the old one on the same states. Large KL spikes indicate too-large updates (often from high learning rates or unnormalized advantages). Even without implementing PPO, you can compute KL(c0_old||c0_new) on the batch as a “speedometer.” If KL is consistently high, reduce actor LR, increase batch size, or tighten gradient clipping.
Reward scaling matters because returns directly scale gradients. If rewards are huge, gradients explode; if tiny, learning is slow. Standardize rewards with wrappers, rescale by a constant, or rely on advantage normalization (still track raw episode return as the true metric). Also watch the critic: a rapidly growing value loss often means the reward scale is too large or returns are poorly computed across episode resets.
Practical outcome: you can tell whether a learning problem is exploration (entropy), step size (KL), or signal scaling (reward/advantages). This makes iteration systematic rather than guesswork.
1. Why are policy gradient methods a better fit than value-based methods for many continuous-control game AI tasks?
2. In REINFORCE, what is the role of adding a learned baseline/value function?
3. How does a Gaussian policy typically represent actions in a continuous action space?
4. What is the main purpose of advantage normalization and stabilizers like gradient clipping in policy gradient training?
5. Which statement best captures the chapter’s mental model comparison between policy gradients and DQN?
In earlier chapters you built agents that can learn, but “can learn” is not the same as “learns reliably.” Most RL failures in game environments are not caused by fancy algorithm mistakes; they are caused by engineering mismatches between the agent and the environment: observations that are poorly scaled, action spaces that are awkward for the policy, reward functions that inadvertently incentivize the wrong behavior, and training loops that cannot resume or be compared across runs.
This chapter treats your Gym/Gymnasium environment as a product you are designing. You will use wrappers to make the interface stable and ergonomic for learning, shape rewards without changing the real objective, introduce curricula and difficulty schedules, and build a training pipeline that produces reproducible artifacts: logs, checkpoints, configs, and results you can analyze later.
A good mental model is: the environment is an API. Wrappers are the adapter layer that enforce conventions (dtype, shapes, scaling, clipping) and inject instrumentation (episode statistics, videos, reward components). Once you standardize this layer, your RL algorithms become easier to debug, compare, and extend.
Practice note for Create environment wrappers for observations, actions, and rewards: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design reward shaping without breaking the objective: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add curriculum learning and difficulty schedules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a scalable training loop with checkpoints and resumes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run controlled hyperparameter sweeps and track results: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create environment wrappers for observations, actions, and rewards: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design reward shaping without breaking the objective: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add curriculum learning and difficulty schedules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a scalable training loop with checkpoints and resumes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run controlled hyperparameter sweeps and track results: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Observation preprocessing is the first place wrappers pay off. Many Gym environments expose observations that are technically correct but awkward for learning: pixel frames with large dynamic range, low-level RAM values, or vectors whose components have very different scales. Your job is to present the agent with a consistent, information-rich signal while staying honest about what the agent can observe.
Frame stacking is the classic example. In Atari-style games, a single frame does not reveal velocity; stacking the last k frames makes motion inferable. The wrapper should (1) keep a fixed-size deque of frames, (2) return a stacked tensor with predictable layout (e.g., (k, H, W) for channels-first), and (3) handle episode resets by filling the stack with the initial frame. Common mistakes include forgetting to reset the buffer (leading to leakage across episodes) or stacking after resizing inconsistently (causing subtle shape changes).
Normalization is equally important for vector observations. Neural networks train faster when inputs are roughly standardized. A practical approach is an online running mean/variance wrapper that normalizes observations to zero mean and unit variance. However, be careful: if you normalize using statistics that include evaluation episodes, you contaminate your metrics. Separate training and evaluation environments, each with its own normalization state (or freeze stats for evaluation). For image observations, normalization often means scaling uint8 pixels to [0,1] or [-1,1] and optionally converting to grayscale and resizing.
When done well, preprocessing makes your training loop simpler: the model sees a fixed shape and stable numeric ranges across runs, seeds, and environment versions.
Action space design determines what your agent is capable of expressing. In Gymnasium, actions might be discrete (e.g., Discrete(n)) or continuous (e.g., Box(low, high)). Wrappers are the right place to impose constraints or map between representations so the policy outputs remain well-behaved.
For continuous control, clipping is a practical necessity: policies can output values outside the legal bounds, especially early in training. An ActionClipWrapper that clamps actions to [low, high] prevents invalid actions and environment errors. But note the learning implication: if many actions get clipped, the policy receives distorted gradients (it thinks it took action a, but the environment executed clip(a)). A better pattern is to parametrize the policy output with a squashing function (e.g., tanh) and then scale to bounds, so actions are always valid by construction.
Discretization is tempting when you want to use DQN-like algorithms on a continuous problem: discretize each dimension into bins and treat the cartesian product as a discrete action. The tradeoff is combinatorial explosion. Two dimensions with 11 bins each already yields 121 actions; four dimensions yields 14,641 actions, often impractical. Discretization can work for simple games or when only a subset of actions matters, but you must measure whether performance saturates due to coarse control.
Action wrappers are also a clean place to implement game-specific ergonomics: map a smaller “meta-action” space to complex combos, or enforce constraints like “cannot move left and right simultaneously.” Always document the mapping in your experiment config so results remain interpretable.
Reward design is where you translate “what you want” into a learning signal. In games, sparse rewards (win/lose) often make exploration too hard, but naive shaping can produce agents that maximize shaped reward while ignoring the real objective. The goal is to add guidance without changing what optimal behavior means.
A dependable pattern is potential-based shaping: add a term of the form F(s, s') = gamma * Phi(s') - Phi(s), where Phi is a potential function encoding progress (distance to goal, remaining enemies, etc.). This preserves the optimal policy under standard assumptions while providing dense feedback. In practice, you implement it in a reward wrapper that tracks the previous potential and adds the shaped difference each step.
Another safe pattern is reward decomposition for logging. Keep the environment’s original reward as the “objective reward,” and add auxiliary components (e.g., exploration bonus, time penalty) but log each component separately. This helps you diagnose when the agent is “gaming” the auxiliary term.
Finally, remember that wrappers can modify reward but should not hide it. Preserve both: train on the shaped reward if you must, but always evaluate and model-select on the original task reward to avoid self-deception.
Curriculum learning makes hard problems learnable by ordering experiences from easy to difficult. In games, this can mean shorter levels, slower opponents, simplified physics, or starting the agent closer to the goal. The key is to control difficulty with a schedule that is explicit, logged, and reversible—so you can reproduce results and avoid accidentally training on a different task than you evaluate.
A practical implementation is a curriculum wrapper that reads a difficulty parameter and modifies environment settings at reset(). For example, every N episodes increase enemy speed by 5%, or expand the range of starting positions. Another approach is performance-based progression: when average episode return over the last 100 episodes exceeds a threshold, advance to the next difficulty stage. Performance-based schedules can be more data-efficient, but they can also “lock” if metrics are noisy; add patience and smoothing.
Domain randomization is a related idea aimed at robustness: randomize textures, spawn locations, wind, friction, or sensor noise. This prevents the policy from overfitting to one narrow configuration and often improves transfer to new levels or seeds. The engineering rule is to randomize factors that are nuisance variables, not the core objective. If you randomize too aggressively early, learning may collapse because the agent cannot form stable associations.
When done carefully, curricula and randomization become knobs you can tune systematically—rather than a pile of ad-hoc environment changes that make experiments impossible to compare.
RL experiments are noisy, compute-heavy, and sensitive to seeds. If you cannot reconstruct what happened, you cannot trust your conclusions. A professional training pipeline treats tracking as a first-class feature: every run writes structured logs, saves configs, stores artifacts, and records the code version.
Start with a single source of truth for configuration (YAML/JSON or argparse) and save it verbatim into the run directory. Include: environment id/version, wrapper stack and parameters, reward shaping coefficients, model architecture, optimizer settings, seed(s), total steps, evaluation cadence, and any curriculum schedule. If you use sweeps, log both the sampled hyperparameters and the sweep definition. This is what makes “controlled hyperparameter sweeps” actually controlled.
For logs, write both step-level training metrics and episode-level metrics. At minimum: episode return (objective and shaped), episode length, success rate (if defined), loss values, Q-value statistics or entropy (depending on algorithm), and wall-clock throughput (steps/sec). For evaluation, run deterministic episodes with fixed seeds and log mean/median return plus confidence intervals across seeds.
Whether you use TensorBoard, Weights & Biases, MLflow, or plain CSV, the principle is the same: produce a complete story of each run so you can compare seeds, run ablations, and explain results months later.
Checkpointing is not just “save weights sometimes.” In RL, a checkpoint must capture enough state to resume training without changing the trajectory. That typically includes: model parameters, target network parameters (for DQN-like methods), optimizer state, replay buffer contents (or a clear policy for rebuilding it), running normalization statistics, environment and RNG states (Python, NumPy, torch), and the current global step/episode counters.
A robust pattern is to save two checkpoint streams: latest (for resuming after interruption) and best (for model selection). “Best” should be defined using evaluation on the objective reward, not the shaped reward. If you select on shaped return, you may pick a model that exploits the shaping term but performs worse on the real task. Also, keep evaluation separate from training: use a distinct environment instance and disable exploration noise (e.g., greedy policy for DQN, mean action for Gaussian policies) to reduce variance.
Early stopping in RL requires caution because learning curves can be non-monotonic. A practical approach is patience-based stopping on a smoothed evaluation metric: stop if the moving average has not improved for K evaluations, but only after a minimum number of steps to avoid terminating during initial exploration. For hyperparameter sweeps, early stopping can save compute, but ensure you apply the same stopping rule across trials, or your comparison becomes biased.
With checkpointing, careful model selection, and sensible stopping rules, your training pipeline becomes scalable: you can run long jobs, recover from failures, and produce results that stand up to scrutiny.
1. According to the chapter, what is the most common root cause of RL failures in game environments?
2. Which description best matches the chapter’s mental model of an environment and wrappers?
3. What is the key constraint the chapter places on reward shaping?
4. Why does the chapter argue for building a training loop with checkpoints and resume support?
5. How do curriculum learning and difficulty schedules fit into the chapter’s approach to reliability?
Training an RL agent is only half the job. In game AI, the real question is whether your agent reliably performs when the environment changes slightly, when observations are imperfect, and when you run it on a different machine or frame rate. This chapter turns your training code into an evaluation workflow you can trust, and then into a shippable artifact you can deploy.
The key mindset shift is to treat evaluation like a product test suite, not like a training loop with rendering turned on. You will build an evaluation harness that runs across multiple seeds, compares against baselines, and supports ablations (turning off components to verify what actually matters). You will stress-test robustness with noise and environment shifts. Finally, you will optimize inference, package the agent behind a stable interface, and write a concise report with a reproducibility checklist so another engineer (or future you) can re-run the experiment.
A common mistake is to “evaluate” by watching a handful of episodes and trusting intuition. Another is to tune hyperparameters based on the same evaluation episodes, accidentally leaking information from eval into train. The practices in this chapter help you avoid those traps and make your results defensible.
Practice note for Build an evaluation harness with multiple seeds and confidence bounds: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compare against baselines and run ablations on key components: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Test robustness under observation noise and environment changes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Optimize inference speed and package the agent for deployment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write a concise experiment report and reproducibility checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build an evaluation harness with multiple seeds and confidence bounds: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compare against baselines and run ablations on key components: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Test robustness under observation noise and environment changes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Optimize inference speed and package the agent for deployment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
First, draw a hard line between training and evaluation. Training is allowed to be noisy, exploratory, and adaptive (epsilon-greedy, entropy bonuses, learning rate schedules). Evaluation should be deterministic and comparable across runs: fixed policy (no exploration), fixed environment configuration, fixed preprocessing, and consistent termination rules.
In Gym/Gymnasium code, this usually means creating separate environment instances: env_train and env_eval. Even if they are the “same” environment, they must be reset with different seeds and must not share wrappers that mutate state in subtle ways. For example, a normalization wrapper that updates running mean/variance during evaluation leaks evaluation statistics into the policy. The correct approach is: update statistics during training, then freeze them for evaluation.
Another leakage vector is checkpoint selection. If you evaluate every N steps and then pick the best checkpoint based on eval scores, you have effectively tuned on the eval set. A practical fix is to use three splits: (1) training rollouts, (2) validation rollouts for early stopping/model selection, and (3) a final test evaluation that you run once at the end. In smaller projects you can approximate this by selecting checkpoints based on training proxies (loss curves, Q-values) and only using eval for reporting, but the safest pattern is train/validate/test.
Engineering judgment: if you plan to ship an agent into a game build, your evaluation environment should mimic the runtime as closely as possible (observation resolution, action repeat, frame time). Differences here often explain “it worked in the notebook” failures.
“Average episodic return” is the default RL metric, but games often need additional measures to reflect player-facing outcomes. Choose metrics that map to your design intent, and report more than one so you can detect trade-offs (e.g., a high score strategy that occasionally catastrophically fails).
Start with average return over evaluation episodes (mean and distribution). Then add success rate when the task has a clear binary outcome (level completed, boss defeated). Success rate is easier to interpret than return when reward shaping is heavy or when dense rewards do not align perfectly with “winning.”
Regret is useful when you have a known benchmark, curriculum, or baseline policy. Define regret as the gap between your agent and a reference (optimal if available, or best baseline) per episode or per timestep. In practice, you can compute regret relative to a strong heuristic bot or a scripted policy: regret = return_baseline - return_agent (or the reverse, depending on convention). This helps quantify “how far are we from acceptable play?”
Stability captures reliability across time and across seeds. Report variance (or standard deviation) of returns, the worst-percentile performance (e.g., 5th percentile), and failure modes. In a shipped game agent, the tail matters: one in fifty catastrophic episodes can still be unacceptable if it looks like the AI “breaks.”
Common mistake: averaging returns across episodes with different lengths without also reporting episode length. Some agents “game” the metric by ending episodes early (intentionally or accidentally). Track average episode length and, if relevant, time-to-success.
RL results can vary dramatically with random seeds. A single run is a demo, not evidence. Your evaluation harness should run multiple training seeds and multiple evaluation seeds per checkpoint. A practical baseline is 5 training seeds; in heavier environments you may choose 3, but then be explicit about uncertainty.
Implement an evaluation harness that takes: (1) a trained policy checkpoint, (2) a list of eval seeds, and (3) a fixed number of episodes per seed. Aggregate metrics across all episodes and report confidence intervals (CIs). Bootstrapped CIs are simple and robust: resample episodes with replacement, compute the mean, repeat (e.g., 10,000 times), and take the 2.5/97.5 percentiles for a 95% CI.
When comparing agents (e.g., DQN vs DQN+Double, or with/without target network), use paired comparisons when possible: evaluate both agents on the same set of seeds and episode indices. This reduces noise and makes ablation conclusions sharper. If you report significance, prefer non-parametric tests (like Wilcoxon signed-rank) on per-seed means, but avoid overemphasizing p-values. In engineering terms, effect size and confidence bounds are more actionable than a binary “significant/not.”
torch.no_grad(), model.eval()).Common mistake: changing code between runs without versioning. Your harness should record git commit, dependency versions, and environment IDs. Otherwise, seed control won’t save you from silent drift.
After you have a clean eval harness, expand it into a robustness suite. Robustness asks: does the policy still behave reasonably when the world is slightly different from training? In games, this is unavoidable: frame timing changes, textures differ, physics updates vary, and players create states your training distribution rarely saw.
Start with observation perturbations. Add controlled Gaussian noise to continuous observations, salt-and-pepper noise to pixel inputs, or random occlusions (drop a small patch). If you use preprocessing like grayscale, resizing, or normalization, apply perturbations after preprocessing to simulate sensor noise at the policy interface. Measure degradation curves: performance vs noise level.
Next, test environment shifts. Examples include changing enemy speed, gravity, action repeat, or reward scaling. In Gymnasium, you can implement shifts via wrappers or environment parameters. Keep shifts small and interpretable; you want to learn whether the agent relies on brittle timing hacks or memorized trajectories.
Finally, add stress cases: adversarial initial states, rare corner cases, and longer horizon episodes. For example, randomize spawn positions more widely, extend max episode steps, or insert “stuck” states to see whether the agent recovers. Log qualitative traces for failures (state, action, Q-values/probabilities) so debugging is possible.
Engineering judgment: don’t chase robustness blindly. Decide which shifts reflect real deployment variation, and prioritize those. A robustness test suite is most valuable when it mirrors how the game will actually change (different device performance, patched content, or player-driven state diversity).
Shipping an agent means turning “a training script that produced a checkpoint” into “a versioned component that can run fast and predictably.” Begin by saving not just weights, but the full inference contract: observation preprocessing parameters, action space mapping, and any wrappers required to interpret the environment.
For PyTorch-based agents, store: model state_dict, architecture config (layer sizes, frame stack count), and normalization statistics. Consider exporting to TorchScript or ONNX if you need stable, optimized inference in a production runtime. Always verify numerical parity between the exported model and the original within a small tolerance.
Wrap inference in a small class with a single method, e.g., act(obs) -> action. Inside it: apply preprocessing, move tensors to the correct device, run no_grad, and return an environment-valid action. This wrapper is the boundary you will test and benchmark. Include a batch mode (act_batch) if you plan to run many agents or parallel simulations.
Optimize speed by reducing per-step overhead: avoid Python-side list manipulations, pre-allocate tensors, and keep the model on the target device. For games, latency matters more than throughput; measure wall-clock time per decision under realistic conditions (same hardware, same frame budget). If you use frame stacking, manage a ring buffer to avoid copying full stacks each step.
Common mistake: deploying with training-time wrappers that alter reward or termination. Deployment should only include observation/action transformations needed to run the policy, not training conveniences like reward clipping unless your policy truly expects it.
A strong experiment is one you can explain quickly and reproduce later. Your final deliverable should include: (1) plots that answer the performance question, (2) a short report that explains what you did and why, and (3) a reproducibility checklist. This is the difference between a promising prototype and an engineering-ready result.
For plots, include learning curves with shaded confidence bands across training seeds. Show evaluation metrics over training steps (not just at the end) so readers can see stability and potential overfitting. Add bar charts or tables for robustness suites (performance under noise/shift levels). If you ran ablations, plot them side-by-side: “full agent” vs “without target network” vs “without replay,” etc. Ablations are not busywork—they are how you prove which components are actually buying performance.
Your report can be concise (1–2 pages) if it is structured: environment description, algorithm and key hyperparameters, compute budget, baselines compared, ablation summary, robustness findings, and deployment notes (inference speed, export format). Include failure analysis: one or two representative failure modes and what triggers them.
Common mistake: only reporting the best run. Always report across seeds with uncertainty. Decision-makers care about expected performance and risk, not a single lucky outcome.
1. Why does the chapter recommend treating evaluation like a product test suite rather than a training loop with rendering turned on?
2. What is the main purpose of running evaluation across multiple seeds with confidence bounds?
3. How do baselines and ablations help you interpret an RL agent’s performance?
4. Which practice best avoids accidentally leaking information from evaluation into training?
5. What combination of tasks best reflects the final step of turning a trained agent into something you can ship?